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Abstract 

^  An  analysis  of  a  general  Discrete  Event  Simulation  (DES),  executing  on  the  dis¬ 
tributed  architecture  of  an  eight  node  Intel  iPSC/2  hypercube,  was  performed.  The  most 
time  consuming  portions  of  the  general  DES  algorithm  were  determined  to  be  the  functions 
associated  with  message  passing  of  required  simulation  data  between  processing  nodes  of 
the  hypercube  architecture.  A  behavioral  description,  using  the  IEEE  standard  VHSIC 
Hardware  Description  and  Design  Language  (VHDL),  for  a  general  DES  hardware  acceler¬ 
ator  is  presented.  The  behavioral  description  specifies  the  operational  requirements  for  a 
DES  coprocessor  to  augment  the  hypercube’s  execution  of  DES  simulations.  The  DES  co¬ 
processor  design  implements  the  functions  necessary  to  perform  distributed  discrete  event 


simulations  using  a  conservative  time  synchronization  protocol. 


IX 


Requirements  Analysis  for  a  Hardware,  Discrete-Event, 
Simulation  Engine  Accelerator 


I.  Introduction 

1.1  Background 

Computer  simulations  are  used  in  a  broad  range  of  diverse  applications  such  as 
engineering,  medicine,  social  sciences,  and  the  military.  Traditionally,  simulations  were 
designed  for  and  executed  on  sequential  processors.  However,  dramatic  increases  in  the 
size  and  complexity  of  simulations  over  the  past  20  years  have  resulted  in  simulation  models 
“whose  computational  requirements  cannot  be  reasonably  satisfied  with  even  the  fastest 
sequential  processors”  (28:8). 

The  design  of  electronic  hardware  is  one  area  where  the  increased  complexity  of  sim¬ 
ulation  models  is  very  evident.  The  rapid  growth  in  component  to  chip  densities  requires 
simulation  of  ever  larger  circuits.  Since  1960  the  circuit  to  chip  ratio  has  nearly  doubled 
every  year,  resulting  in  densities  greater  than  500,000  transistors  per  chip  (12:449).  Al¬ 
though  this  growth  rate  has  slowed  to  a  doubling  about  every  two  years,  the  required  logic 
simulation  has  become  a  major  limitation  in  the  overall  design  process. 

The  Air  Force  has  a  large  investment  i  .  electronic  hardware,  and  the  development 
costs  continue  to  increase  as  the  hardware  becomes  more  complex.  The  Air  Force’s  in¬ 
creased  reliance  on  electronic  hardware  is  contributed  significantly  to  the  Department  of 
Defense’s  Very  High  Speed  Integrated  Circuit  (VHSIC)  program.  A  primary  objective  of 
the  VHSIC  program  is  to  develop  and  promote  the  use  of  high-density  integrated  circuits 
in  military  systems. 

VHSIC  technology  is  heavily  dependent  on  simulation  for  the  design  and  verification 
of  these  complex  electronic  components.  Logic  verification  and  fault  analysis  are  essential 
in  the  design  of  VHSIC  chips  and  must  be  performed  extensively  before  chip  fabrication. 
This  complex  testing,  done  through  simulation,  often  consumes  months  of  computer  time 
and  has  become  a  bottleneck  in  the  logic  design  process  (12:449). 


1-1 


The  VHSIC  Hardware  Description  Language  (VHDL)  program  was  started  in  1983 
to  standardize  the  tools  needed  to  design,  test,  and  document  large-scale  digital  electron¬ 
ics  more  efficiently.  Initial  imp  imentations  of  VHDL  were  developed  by  Intermetrics,  Inc. 
under  a  DoD  contract  in  1983.  Evolution  and  improvements  in  the  language  led  to  the 
IEEE  Standard  VHDL  Language  Reference  Manual  in  1987.  VHDL  has  become  important 
enough  in  recent  years  that  the  Department  of  Defense  Advanced  Research  Projects  Agency 
(DARPA)  has  spon  ed  the  QUEST  project.  One  objective  of  the  QUEST  project  is  sim¬ 
ulation  acceleration,  specifitaHy  a  thousand-fold  speedup  in  VHDL  simulations  of  VHSIC 
designs  is  desired. 

1.2  Problem 

The  limitations  of  traditional  sequential  processors  have  increased  research  in  the 
area  of  applying  parallel  computer  architectures  and  multiprocessor  technology  to  meet  the 
computational  requirements  of  large  simulations.  Theoretically,  if  a  sequential  simulation 
is  logically  partitioned  into  separate  processes,  placed  on  separate  processors  and  run  in 
parallel,  the  amount  of  speedup  attainable  should  be  equal  to  the  number  of  processors 
used. 

The  theoretical  speedup  possible  through  parallel,  or  as  it  is  more  commonly  known, 
distributed  simulation  has  yet  to  be  realized.  Several  obstacles  inherent  to  distributed 
processing  must  be  minimized  to  approach  the  theoretical  speedup.  Among  these  obstacles 
are:  the  communications  overhead  associated  with  the  necessary  exchange  of  information 
between  logical  processes;  the  load  imbalance  related  to  the  static  allocation  of  logical 
processes  to  processors;  and  the  synchronization  delay  necessary  to  ensure  event-driven 
simulations  do  not  process  events  out  of  order. 

This  thesis  investigates  possible  enhancements  to  the  discrete-event  distributed  sim¬ 
ulation  process  that  can  be  realized  through  a  hardware  implementation.  The  purpose  of 
this  research  is  to  specify  the  detailed  requirements  of  such  an  implementation. 
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1.3  Summary  of  Current  Knowledge 

Simulation  models  are  classified  by  Pritsker  as  either  discrete,  continuous,  or  com¬ 
bined.  The  basis  for  this  classifit  tion  is  how  the  dependent  variables  of  the  simulation 
model  change  with  respect  to  time.  In  discrete  simulation  the  dependent  variables  change 
at  specified  points  in  simulated  time  known  as  event  times  and  generally  do  not  change 
values  between  event  times.  Discrete  simulation  is  further  classified  by  the  relationship  be¬ 
tween  events,  activities,  and  processes.  Continuous  simulation  results  when  the  dependent 
variables  of  the  simulation  model  change  continuously  over  simulated  time.  Combined  sim¬ 
ulation  occurs  when  dependent  variables  change  discretely,  continuously,  or  a  combination 
of  both  (26:63-64). 

A  time-based  classification  for  simulation  is  also  proposed  by  Neelamkavil.  In  this 
classification,  time  can  be  advanced  in  two  ways.  The  first  is  a  synchronous,  interval- 
oriented  simulation,  where  time  is  advanced  from  time  t  to  t  -|-  At  in  uniform  fixed  in¬ 
crements  of  At.  The  :3cond  method,  event-oriented  simulation,  is  asynchronous  and  time 
may  advance  in  variable  intervals.  Using  this  approach,  time  is  “incremented  from  time  t 
to  the  next  event  time  t',  whatever  the  value  oft'”  (24:136). 

The  empnasis  of  current  research  is  on  discrete  event  simulation.  This  approach  is 
well-suited  to  digital  logic  simulation  where  only  a  small  portion  of  the  circuit,  typically 
10-15  percent,  is  active  at  a  given  time  (9:67).  Hence,  the  inefficiency  of  simulating  every 
element  in  a  circuit,  when  only  a  fraction  are  switching,  is  avoided. 

Efforts  to  improve  the  performance  of  logic  simulation  fall  into  two  categories.  The 
first  is  a  top  down  approach  of  divide  and  conquer.  That  is,  divide  the  circuit  into  smaller, 
more  manageable  modules  for  which  the  simulation  costs  are  not  so  severe.  This  approach 
is  often  plagued  by  difficulties  in  providing  effective  tests  for  the  interfaces  between  modules 
(12:449). 

The  second  approach  is  to  optimize  the  performance  of  the  simulation  itself  through 
various  speed-up  techniques.  One  avenue  considered  in  this  approach  is  to  identify  those 
portions  of  the  simulation  software  that  occur  frequently  and  are  time  consuming  to  exe¬ 
cute.  According  to  Wong  the  operations  to  consider  for  recoding  are  event-list  manipula- 
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tion,  function  evaluation,  and  net-list  searching,  as  they  account  for  85%  of  the  execution 
time  of  a  logic  simulation  algorithm  (31:47).  Recoding  of  this  software  attempts  to  improve 
efficiency  through  the  use  of  hand  optimized  assembly  language.  Unfortunait'v,  algorithm 
optimization  seldom  yields  more  than  a  three  fold  increase  in  speed  (4:130). 

Another  prevalent  approach  to  accelerating  simulation  is  the  use  of  special  purpose 
hardware  and  digital  computers  tailored  to  logic  simulation.  Special  purpose  computers 
can  have  performance  orders  of  magnitude  faster  than  the  current  software  simulators 
(12:449). 

Special  purpose  hardware  architectures  attempt  to  exploit  the  concurrency  within 
the  simulation  algorithm.  This  concurrency  results  when  different  events  are  scheduled 
for  the  same  time,  which  occurs  frequently  in  logic  simulation  (1:84).  Gatlin  offers  two 
approaches  to  parallelizing  this  inherent  concurrency  in  simulations,  data  partitioning  and 
functional  partitioning.  Data  partitioning  employs  several  processors  performing  identical 
functions  on  different  portions  of  the  input  data.  However,  the  complex  interprocessor 
communications  and  elaborate  hardware  requirements  make  this  approach  unattractive. 
Functional  partitioning  takes  advantage  of  the  structure  of  the  simulation  algorithm.  The 
algorithm  is  broken  into  portions  of  approximately  equal  complexity  and  disjoint  data 
structures.  Fach  portion  of  the  algorithm  is  then  assigned  to  a  separate  processor  for 
execution  (4:130). 

Pure  parallelism  is  not  always  possible.  Often  processes  must  access  the  same  data 
as  in  the  case  of  an  event  list  maintained  in  one  memory.  Concurrency  is  still  possible  with 
multiple  processing  elements.  Each  processing  element  performs  an  individual  task  while 
data  flows  between  them  in  a  pipeline  fashion  (1:84). 

1-4  Constraints 

Benefits  from  speed-up  improvements  in  discrete  event  simulation  can  extend  not  only 
to  digital  logic  simulation  but  also  to  a  variety  of  applications.  The  need  to  speed  up  digital 
logic  simulation  is  obvious;  however,  this  research  applies  to  the  broader  area  of  discrete 
event  simulation  in  general.  The  potential  for  speedup  of  digital  logic  simulation  may  be 
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limited  by  focusing  on  the  greater  objective  of  enhancing  all  discrete  event  simulations. 
This  larger  domain  of  discrete  event  simulation  applications  may  constrain  the  design  of 
accelerator  hardware.  Design  options  that  would  enhance  aspects  unique  to  digital  logic 
simulation  would  ne  essarily  be  dismissed  in  favor  of  more  general  applications. 

The  majority  of  discrete  event  simulations  are  currently  executed  on  the  Intel  zPSC /2 
hypercube.  Continued  use  of  this  distributed  architecture,  based  on  the  Intel  80386  CPU, 
is  anticipated  as  it  represents  near  state-of-the-art  technology  and  is  readily  available  for 
research  use. 

1.5  Scope 

Implementation  of  a  specific  hardware  simulation  accelerator  was  not  the  goal  of  this 
effort.  The  research  focused  instead  on  a  detailed  requirements  analysis  for  the  design  of 
specific  hardware  enhancements  to  accelerate  discrete  event  computer  simulations  using 
the  Intel  iPSC/2. 

The  detailed  requirements  specification  was  documented  using  VHDL.  Validation 
and  evaluation  of  the  design  and  degree  of  speedup  realized  was  conducted  through  VHDL 
simulations. 

The  target  architecture  is  a  distributed  parallel  computer;  however,  design  testing  was 
performed  on  a  single  processor  model,  representing  a  single  processing  node  of  the  zPSC/2 
hypercube.  The  effects  of  interprocessor  communication,  processor  synchronization,  and 
load  balancing  were  not  measured  in  this  configuration,  rather  the  accelerator  performance, 
relative  to  CPU  execution  time,  was  evaluated. 

1.6  Standards 

The  evaluation  of  simulation  speed  is  sometimes  ambiguous.  Simulation  performance 
is  rated  using  different  measurements  throughout  industry  and  current  literature.  Com¬ 
mon  measurements  include  gate  evaluations  per  second,  instructions  per  second,  and  events 
per  second.  Each  measurement  provides  different  information  about  a  simulation’s  perfor¬ 
mance. 
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This  effort  focused  on  the  simulation  execution  time  for  a  particular  class  of  simula¬ 
tion  modeling  discrete  event.  The  actual  run  times  of  a  specific  discrete  event  simulation 
provided  the  datum  for  this  effort. 

The  proposed  hardware  accelerator  design  was  evaluated  with  respect  to  to  this  stan¬ 
dard.  Abramovici  contends  that  an  order  of  magnitude  speed  up  is  a  minimum  design  goal 
(1:83);  however,  DARPA’s  thrust  is  a  speedup  of  three  orders  of  magnitude  over  tradi¬ 
tional  simulations  through  the  use  of  parallel  processing,  supplemented  with  a  dedicated 
hardware  accelerator. 

The  objective  of  this  effort  was  to  determine  how  a  general  discrete  event  simula¬ 
tion  might  be  improved  through  a  hardware  accelerator,  and  to  design  such  a  hardware 
enhancement.  DARPA’s  speedup  goal  is  not  entirely  dependent  upon  this  effort.  Related 
research  in  techniques  to  optimize  simulation  process  distribution  and  minimize  the  ef¬ 
fects  of  load  imbalance,  communications  overhead,  and  synchronization  delay  in  a  parallel 
implementation  will  add  to  the  performance  gains  realized  through  hardware  acceleration. 

1.7  Approach/Methodology 

The  analysis  of  a  general  discrete  event  simulation  model  provided  a  definition  for 
the  problem  space  and  was  used  as  the  foundation  for  the  remainder  of  this  effort.  This 
analysis  clearly  defined  the  portions  of  the  simulation  model  that  exhibit  the  greatest 
potential  for  speedup  through  hardware  enhancements.  Specifically,  those  areas  of  the 
simulation  model  that  required  the  greatest  portion  of  overall  execution  time  and  relative 
frequency  of  execution  were  emphasized  in  the  design  of  a  hardware  accelerator. 

The  potential  for  simulation  speedup  via  the  application  of  special  purpose  hardware 
was  evaluated  along  with  the  trade  offs  associated  with  the  a  hardware  implementation. 
The  hardware  accelerator  requirements  were  specified  and  implemented  using  VHDL. 

A  testbed  was  devised  to  evaluate  the  VHDL  accelerator  design.  A  VHDL  behav¬ 
ioral  model  of  the  Intel  80386  CPU  was  not  available,  hence  a  complete  CPU/accelerator 
system  evaluation  could  not  be  performed.  Rather  test  vectors  representing  discrete  event 
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simulation  instructions  and  data  along  with  CPU  interface  and  control  signals  were  used 
for  accelerator  design  evaluation. 

The  VHDL  design  tests  were  iterative  in  nature  and  revealed  both  design  strengths 
and  shortcomings.  The  test  evaluation  and  feedback  process  was  instrumental  in  the  de¬ 
sign  s  evolution.  Portions  of  the  design  remain  as  VHDL  behavioral  descriptions;  however 
the  detailed  requirements  for  a  discrete  event  hardware  accelerator  are  completely  specified 
when  considering  the  design  as  a  whole. 
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11.  Simulation  Acceleration  Issues 


2. 1  Introduction 

The  use  of  computer  simulation  to  predict  the  outcome  of  events  or  the  performance 
of  physical  processes  is  not  new.  Computers  have  provided  a  means  to  simulate  a  broad 
range  of  problems  in  fields  as  diverse  as  engineering,  economics,  sociology,  and  weather. 
This  proliferation  of  computer  simulation  has  led  to  models  of  increased  complexity  and 
often  time-consuming  simulation  programs. 

The  Department  of  Defense  is  keenly  aware  of  the  time-consuming  nature  of  complex 
simulations.  Time  delays  to  conduct  simulations  may  adversely  impact  a  commander’s 
ability  to  make  an  informed  decision  or  delay  the  development  of  a  new  system. 

Considering  the  diverse  applications  and  increased  reliance  on  computer  simulations, 
the  Department  of  Defense  is  investigating  methods  of  speeding  up  the  simulation  process. 
The  DoD’s  emphasis  on  Very- High  Speed  Integrated  Circuit  (VHSIC)  technology  is  one 
area  that  requires  significant  improvement  in  simulation  speed.  This  is  readily  apparent  if 
one  considers  that  simulating  one  second  of  real-time  for  an  application  specific  integrated 
circuit  may  take  days  of  dedicated  processor  time  (14:42). 

This  chapter  is  an  overview  of  different  approaches  available  for  accelerating  computer 
simulation.  Various  simulation  methods  are  described  and  options  for  accelerating  the 
simulation  process  from  a  hardware  perspective  are  presented. 

2.2  Simulation  Techniques 

The  two  main  categories  of  simulation  are  continuous  (time-driven)  and  discrete- 
event  simulation.  The  time-driven  approach  '  Ji^racterized  by  regular  advances,  of  a 
predetermined  and  fixed  increment,  of  a  sim.lation  ock.  The  values  of  all  simulation 
variables  are  evaluated  and  updated  after  each  clock  advance.  If  no  variables  are  affected, 
the  clock  simply  advances.  Event-driven  simulations  use  a  clock  that  advances  to  the 
future  time  of  the  next  scheduled  event.  In  discrete- event  simulation,  scheduled  events  are 
repeatedly  fetched  from  a  queue  and  simulated  and  only  those  variables  affected  by  the 
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event  are  updated.  Each  event  simulation  may  spawn  new  events  which  are  inserted  into 
the  event  queue  at  the  appropriate  time  (22:39). 

A  discrete-event  simulation  allows  the  simulator  to  skip  intervals  of  time  where  no 
events  are  scheduled.  The  modeling  of  complex  digital  circuits  is  well  suited  for  discrete- 
event  simulation  since  signal  values  change  at  discrete  times  and  only  a  limited  number  of 
circuit  elements  are  active  at  any  given  time. 

2.3  Distributed  Processing 

Discrete-Event  Simulation  (DES)  programs  often  require  computational  capabilities 
that  exceed  the  fastest  available  machines  (13:81).  Parallel  computer  architectures  have 
the  potential  to  overcome  the  speed  limitations  of  single  processor  computers  and  thus, 
have  received  widespread  attention. 

2.3.1  Taxonomy  for  DES  Architectures  A  notation  similar  to  Flynn’s  for  parallel 
architectures  (e.g.  MIMD,  SIMD  etc.)  can  be  used  to  describe  the  main  architectural 
features  of  DES  machines.  The  basis  for  this  taxonomy  is  the  DES  algorithm  and  its  three 
essential  elements: 

•  Time  control 

•  Event  list  control 

•  Event  (function)  evaluation 

The  implementation  of  these  components  may  vary  between  simulators  but,  in  one 
form  or  another,  they  are  all  present  in  any  DES  (12:450).  The  time  control  compo¬ 
nent  (clock  control)  determines  the  progression  of  simulated  time.  The  event  list  control 
component  schedules  events  in  increasing  time  order  and  the  event  evaluation  component 
processes  the  accessed  events  and  determines  if  new  events  should  be  scheduled. 

The  taxonomy  has  four  components:  two  specify  time  control  characteristics,  and 
one  each  for  specifying  event  list  control  and  event  evaluation  components  (see  Table  2.1). 
The  time  control  mechanisms  define  the  simulation’s  classification  with  respect  to  time  - 
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unit  increment  corresponding  to  continuous  time  and  event  based  increment  corresponding 
to  discrete  event  time.  In  a  multiprocessor  system,  synchronization  may  be  provided  by  a 
single  “global  clock”  or  each  processor  may  maintain  a  “local  clock.” 


Table  2.1.  DES  Taxonomy  Components 


1.  TIME  CONTROL  MECHANISMS 

A,  TIME  ADVANCE 

1)  Unit  Increment  (UI) 

2)  Event-based  Increment  (El) 

B.  TIME  SYNCHRONIZATION 

1)  Global  Clock  (GC) 

2)  Local  Clock  (LC) 

2.  EVENT  LIST  ATTRIBUTES 

1)  Single  List  (SL) 

2)  Multiple  List  (ML) 

3.  EVENT/FUNCTION  EVALUATION 

1)  Single  Machine  (SM) 

_ 2)  Multiple  Machine _ (MM) 


Similarly,  the  event  list  can  be  distributed  and  portions  maintained  by  each  processor 
or  totally  by  a  single  processor.  A  distributed,  or  multiple  event  list,  eliminates  the  delay 
time  to  communicate  the  next  scheduled  event  and  is  potentially  faster  than  the  single 
event  list. 

The  last  component,  event /function  evaluation,  indicates  whether  a  single  or  multiple 
processors  are  used.  Using  this  taxonomy  sixteen  possible  machine  architectures  can  be 
specified  by  the  tuple: 

Time  Advance/Time  Synchronization/Event  List  Attributes/Event  Evaluation 

Eight  of  the  sixteen  possible  architectures  are  implemented  with  a  single  machine  (SM)  and 
represent  traditional  sequential  architectures.  The  remaining  eight  are  multiple  machine 
(MM)  architectures  which  include  the  Intel  iPSC/2  hypercube. 

2.3. 1.1  Parallel  Architectures  The  multiple  machine  architectures  represent 
parallel  processing  systems.  Speedup  is  obtained  by  distributing  the  simulation  workload 
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among  several  concurrent  processors;  however,  this  is  not  without  some  cost.  Each  parallel 
architecture  has  some  limitation  that  must  be  considered  and  its  effects  evaluated  during 
system  design  (12:452). 

2.3. 1.2  Multiple  Machines  with  Global  Clocks  -  X/GC/Y/MM  These  archi¬ 
tectures  take  advantage  of  parallel  function  evaluation  to  speed  up  simulation.  A  single 
processor  acts  as  a  master  and  maintains  the  global  clock.  The  use  of  a  global  clock  min¬ 
imizes  time  synchronization  problems;  however,  an  efficient  communications  network  is 
required  and  the  logical  processes  must  be  partitioned  for  effective  load  balancing.  Par¬ 
allel  event  list  manipulation  is  also  possible  with  multiple  event  lists.  This  eliminates  the 
potential  bottleneck  of  a  centralized  event  list,  but  also  requires  distribution  of  event  time 
information  between  master  and  slave  processors  to  ensure  global  clock  updating. 

2.3.1. 3  Multiple  Machines  with  Local  Clocks  -  X/LC/Y/MM  Potential  speedup 
in  these  architectures  is  obtained  through  parallel  function  evaluation,  parallel  event  list 
manipulation,  and  distributed  time  management.  MIMD  machines,  such  as  the  Intel 
zPSC/2  hypercube,  are  represented  in  the  taxonomy  as  EI/LC/ML/MM.  Here  the  sim¬ 
ulation  is  mapped  as  a  set  of  autonomous  communicating  processes  that  exchange  time 
synchronization  and  state  information  through  asynchronous  message  passing  (5:198).  This 
distributed  time  management  allows  variable  states  to  be  evaluated  as  each  input  value 
changes. 

2.3.2  Distributed  Discrete  Event  Algorithms  In  a  distributed  processing  environ¬ 
ment,  discrete-event  simulations  map  one  or  more  server/queue  pairs  onto  the  active  pro¬ 
cessors  in  the  network.  Each  processor  operates  with  its  own  simulation  clock  and  messages 
are  timestamped  to  reflect  the  simulated  time  at  the  sending  node.  Individual  processors 
may  have  separate  processes  executing  on  them  and  messages  are  routed  between  the 
processor  pairs  by  directed  channels  (22:51). 

Various  distributed  discrete  event  algorithms  have  been  proposed,  but  two  approaches, 
the  Chandy-Misra  algorithm  and  the  Time  Warp  algorithm,  are  most  notable  (28:8).  The 
distinguishing  feature  between  these  algorithms  is  how  they  manage  simulation  time. 
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2. 3. 2.1  Optimistic  Paradigm  -  Time  Warp  Algorithm  The  Time  Warp  algo¬ 
rithm  relies  on  general  lookahead  -  rollback  as  its  fundamental  synchronization  mechanism 
(20.404).  Each  local  simulation  clock  advances  independently  unless  conflicting  informa¬ 
tion  (i.6.,  a  message  from  the  past)  occurs,  at  which  point  the  local  simulation  clocks 
are  rolled  back  to  a  consistent  state,  antimessages  are  sent  to  override  the  erroneous 
messages,  and  execution  advances  along  a  revised  path  (28:8). 

The  underlying  principle  of  Time  Warp  is  the  concept  of  “virtual  time.”  Virtual 
time  is  a  temporal  coordinate  system  used  to  measure  progress  and  ensure  synchronization. 
Each  processor  is  updated  with  the  global  virtual  time,  which  only  progresses  forward,  in 
addition  to  its  own  simulation,  or  local  virtual  time.  For  a  given  real  time,  the  global 
virtual  time  represents  the  minimum  of  all  local  virtual  times  and  the  virtual  send  times 
of  all  messages  that  have  yet  to  be  processed  (20:417). 

The  primary  overhead  cost  of  Time  Warp  is  associated  with  rollbacks  and  the  commu¬ 
nication  of  antimessages  needed  to  implement  a  rollback  (20:416).  Additionally,  previous 
state  information  must  be  maintained  to  allow  message  cancellation  and  rollback  to  the 
current  global  virtual  time. 

2.S.2.2  Conservative  Paradigm  -  Chandy  -  Misra  Algorithm  The  Chandy- 
k4isra  algorithm  models  a  physical  system  as  a  distributed  network  of  logical  processes 
communicating  via  messages.  The  event  list  and  global  simulation  clock,  of  traditional 
sequential  simulations,  are  replaced  with  an  event  list  and  local  deck  at  each  logical  process. 

An  effective  implementation  of  the  Chandy-Misra  algorithm  is  dependent  upon  the 
following  requirements  (5:198-199): 

•  The  behavior,  at  time  t,  of  the  physical  process  being  modeled  must  not  be  affected 
by  messages  transmitted  after  t.  This  is  referred  to  as  the  realizability  condition. 

•  Messages  between  processes  must  increase  monotonically  in  time  {monotonicity  con¬ 
dition). 

•  Messages  between  logical  processes  must  correspond  exactly  to  the  sequence  of  mes¬ 
sages  between  physical  processes  {predictability). 
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The  primary  difference  between  the  Chandy-Misra  algorithm  and  the  Time  Warp 
technique  is  the  use  of  “null”  messages.  Null  messages  are  encoded  with  a  timestamp  to 
teU  the  receiving  node  that  no  real  message  will  be  transmitted  before  the  specified  time. 
Hence  the  receiving  node  may  process  existing  messages  without  the  possibility  of  reversal 
at  a  later  time  (28:9). 

2.4  Discrete-Event  Logic  Simulation 

Digital  logic  circuits  are  simulated  by  modeling  the  circuit  elements  to  determine 
signal  values  for  a  given  sequence  of  input  signals.  The  data  necessary  to  simulate  an 
element  is  referred  to  as  the  element  record.  The  element  record  typically  contains  current 
input  values,  current  output,  one  or  more  delay  time  values,  the  element  type  code,  fan¬ 
out  count  and  destination,  and  a  set  of  exception  flags  (30:4).  The  major  functions  of  a 
discrete-event  logic  simulator  include  element  data  management,  element  evaluation,  event 
management,  and  exception  handling. 

Any  change  in  the  value  of  an  input,  output,  or  state  variable  of  a  given  element  is 
referred  to  as  an  event.  Events  occur  at  discrete  points  in  simulated  time.  An  element 
whose  input  or  state  variable  has  changed  is  evaluated  to  determine  its  new  output  and 
state.  Transitions  of  state  variables  and  generation  of  new  outputs  must  be  scheduled  for 
some  future  time  as  delays  are  usually  associated  with  the  operation  of  elements  (1:83). 

Scheduled  events  are  maintained  on  an  event  queue.  A  simulation  time-flow  mecha¬ 
nism  manipulates  the  events  and  ensures  that  they  occur  in  correct  temporal  order  (1:83). 
When  all  events  at  the  current  simulation  time  are  exhausted,  the  time  is  advanced  to  the 
next  time  for  which  events  are  scheduled. 

Manipulation  of  the  event  queue  ensures  the  proper  time  sequencing  of  evaluations. 
Additionally,  only  those  elements  scheduled  for  an  event  are  evaluated  at  a  given  simulation 
time.  This  reduction  in  number  of  elements  evaluated  incurs  the  additional  overhead  of 
manipulating  the  event  queue.  Comfort  estimates  that  between  32%  and  40%  of  all  non¬ 
input/output  computer  time  may  be  spent  in  event  queue  processing  (8:117). 
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2.5  Speedup  Alternatives 

Rarely  content  with,  current  technology  and  capabilities,  computer  and  software  de¬ 
signers  continue  to  investigate  methods  of  speeding  up  computer  operations.  This  section 
presents  an  overview  of  speedup  efforts  in  the  area  of  simulation. 

2.5.1  Software  Acceleration  Alternatives  for  accelerating  the  execution  of  logic  sim¬ 
ulations  have  been  proposed.  The  first  approach  often  considered  is  recoding  software  for 
the  most  frequently  occurring  element  routines  and  the  event  queue  manager  (4:130).  This 
approach  improves  efficiency  through  the  use  of  hand-optimized  assembly  language.  Un¬ 
fortunately,  this  approach  seldom  realizes  more  than  a  three-fold  increase  in  speed  (4:130). 
Additionally,  this  implementation  limits  the  transportability  and  maintainability  of  the 
software  (30:2). 

2.5.2  Application  Specific  Hardware  Another  approach  is  to  acquire  a  faster  ma¬ 
chine  or  to  develop  hardware  exclusively  for  simulation.  This  option  provides  the  greatest 
performance  increase  and  can  be  as  much  as  100  to  500  times  the  speed  of  software  sim¬ 
ulations  run  on  a  sequential  microprocessor  (3:27-29).  The  disadvantage  to  this  approach 
is  that  special  hardware  is  usually  difficult  to  modify  in  the  field  and  often  cannot  be  used 
for  anything  else  (4.T30).  Direct  implementation  of  the  simulation  software  in  hardware  is 
also  feasible  but  expensive  and  infiexible  (3:21). 

Several  design  options  for  special  purpose  hardware  to  speed  up  simulation  are  avail¬ 
able.  Smith  suggests  the  use  of  one  or  more  stages  of  microcoded  hardware  designed 
especially  for  high  performance  simulation  (30:2).  Using  this  approach,  four  processors 
could  form  a  pipeline  with  stages  for  event  queue  management,  evaluation  routines,  and 
signal  change  propagation. 

2.5.3  Functional  Partitioning  Gatlin  and  Paseman  contend  that  the  structure  of 
the  simulation  algorithm  can  be  exploited  through  functional  partitioning.  The  simula¬ 
tion  algorithm  is  broken  into  three  pieces  of  approximately  equal  complexity.  A  separate 
processor  is  assigned  to  each  portion  of  the  algorithm  and  its  associated  data  structures. 
The  tasks  of  queue  management,  state  maintenance,  and  element  evaluation  are  performed 
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in  disjoint  processors  and  therefore  operate  simultaneously.  Communication  between  the 
processors  is  through  low  bandwidth  First  In  First  Out  (FIFO)  channels  and  processing  is 
done  in  a  dataflow  fashion.  A  host  microprocessor  serves  as  the  nucleus  of  the  system  and 
provides  a  user  interface  during  the  simulation  (4:130-132). 

A  network  of  inexpensive  but  powerful  microprocessing  elements  is  viewed  by  Com¬ 
fort  as  the  best  method  of  attaining  high  instruction  execution  rates  at  a  moderate  cost 
(7:197).  Similar  to  Catlin  and  Paseman,  Comfort  also  proposes  partitioning  the  simula¬ 
tion  into  functional  processes.  The  function  of  event  set  processing  comprises  one  partition 
and  is  assigned  a  variable  number  or  processors  each  having  ‘next, “schedule,’  and  ‘cancel’ 
functions.  The  remaining  partition  consists  of  all  other  processing  associated  with  the 
simulation  and  is  assigned  to  the  host  processor.  The  host  processor  polls  the  event  set 
processors  for  their  event  notice  of  smallest  next  processing  time.  The  host  then  selects 
the  notice  with  the  smallest  (global)  time  and  acts  upon  it  (8:118). 

2.5. ^  Content-Addressable  Memories  The  use  of  random  access  memories  for  data 
storage  and  retrieval  has  inherent  drawbacks  because  of  its  word-at-a-time,  location-addressed 
implementation  (6:51).  Addressing  by  location  is  inefficient,  particularly  if  data  is  dynam¬ 
ically  unordered  during  processing. 

Content-Addressable  Memories  (CAMs)  are  capable  of  accessing  data  based  on  con¬ 
tent  rather  than  memory  location.  This  ability  permits  data  searches  for  exact  matches 
with  a  specified  key  or  relative  comparisons  for  an  ordered  data  retrieval  (25:725). 

Considerable  speedup  in  processing  time  is  possible  with  content-addressable  mem¬ 
ories.  This  results  from  the  simultaneous  access  of  data  in  parallel  and  the  elimination  of 
the  need  to  store  data  in  sorted  order  (15:509,518). 

2. 6  Summary 

Simulation  is  an  integral  part  of  decision  making  in  various  disciplines.  The  use  of 
computers  for  simulation  has  increased  dramatically  over  the  past  20  years  and  simulation 
models  have  become  more  complex.  The  increased  model  complexity  necessitates  computer 
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enhancements  to  minimize  the  time  required  to  run  these  simulations — particularly  in 
digital  logic  simulation  where  simulations  may  take  days  to  run. 

The  dominant  approach  to  enhancing  computer  simulations  is  to  distribute  the  work¬ 
load  among  multiple  processors  working  in  parallel.  Several  options  of  parallelizing  the 
simulation  are  available  to  the  designer.  Processor  networks  operating  in  a  dataflow  fash¬ 
ion  are  feasible  as  are  pipelines  ot  multiple  stages.  In  both  app’^oaches  the  simulation 
algorithm  is  partitioned  among  the  processors  for  independent  processing. 

Every  designer  must  consider  the  cost  of  design  implementation.  An  additional 
consideration  for  the  design  of  a  simulation  accelerator  is  flexibility.  Application  specific 
hardware  is  often  inflexible  and  one  must  consider  the  tradeoffs  between  speedup  potential 
and  the  opportunity  for  reuse  in  other  applications. 
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III.  Methodology 


3. 1  Introduction 

The  design  of  a  Discrete  Event  Simulation  (DES)  hardware  accelerator  requires  a 
detailed  analysis  of  a  general  DES  algorithm.  The  objective  of  this  analysis  is  to  identify 
simulation  functions  and  routines  that  are  frequently  invoked  and/or  account  for  a  large 
portion  of  the  overall  simulation  execution  time.  Once  determined,  simulation  acceleration 
is  possible  through  implementation  of  these  functions  in  hardware(31;47). 

The  methodology  used  to  analyze  a  general  distributed  DES  is  presented  in  this 
chapter.  A  description  of  the  simulation  testbed  and  the  configuration  of  simulation  logical 
processes  is  given. 

The  parallel  architecture  of  the  Intel  tPSC/2  hypercube  is  described  and  the  different 
simulation  topologies  employed  are  presented.  The  methods  used  for  collecting  simulation 
data  and  the  metrics  for  evaluating  the  data  are  presented  along  with  the  results  of  this 
analysis. 

3.2  Discrete  Event  Simulation  Testbed 

The  parallel  Discrete  Event  Simulation  (DES)  environment  for  this  effort  consisted 
of  an  eight  node  Intel  tPSC/2  hypercube  employing  the  SPECTRUM  simulation  protocol 
interface  designed  by  the  University  of  Virginia.  The  conservative,  Chandy-Misra  null 
message  protocol  was  used  for  parallel  synchronization. 

3.2.1  SPECTRUM  Interface  SPECTRUM  is  a  generic  testbed  designed  for  eval¬ 
uating  parallel  simulation  protocols  (29:865).  Through  the  use  of  user  defined  protocol 
filters,  SPECTRUM  provides  a  transparent  interface  between  the  application  being  mod¬ 
eled  and  the  parallel  processing  architecture  used  to  execute  the  simulation. 

The  application  to  be  simulated  contains  one  or  more  physical  processes  which  are 
modeled  through  Logical  Processes  or  LPs.  Each  simulation  Logical  Process  (LP)  is  com¬ 
posed  of  three  separate  entities  when  executed  under  SPECTRUM.  Referring  to  Figure  3.1, 
each  LP  contains  an  application  component,  a  process  manager,  and  a  node  manager. 
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Application  components  are  portions  of  the  original  application  which  may  be  executed 
concurrently.  The  process  manager  provides  routines  to  support  typical  simulation  require¬ 
ments  such  as  managing  simulation  time  and  event  queues.  Low  level  system  requirements, 
such  as  message  passing  between  LPs  and  scheduling,  when  multiple  LPs  are  mapped  to 
a  single  processor,  are  provided  by  the  node  manager. 


Figure  3.1.  SPECTRUM  Testbed  Logical  Process(29:868) 

The  simulation  protocol  is  implemented  with  SPECTRUM  via  a  user  defined  fil¬ 
ter.  The  filter  provides  the  synchronization  functions  necessary  for  effective  simulation 
execution.  The  basic  filter  functions  required  for  discrete  event  simulation  are:  initialize, 
get-next-event,  post-event,  advance-time,  and  post-message.  All  but  the  last  function  oc¬ 
cur  between  the  application  layer  and  the  process  manager.  Post-message  is  a  message 
handling  function  which  occurs  between  the  process  and  node  managers(29:868).  Hence 
the  use  of  filters  provides  the  interface  between  separate  modules  within  each  LP  while 
providing  the  user  easy  access  for  modifying,  or  replacing,  the  synchronization  protocol. 
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3.2.2  Simulation  Application  The  application  used  for  analysis  and  modeling  a 
general  Discrete  Event  Simulation  is  a  simple  car  wash.  The  physical  process  of  washing 
cars  is  modeled  by  three  logical  processes.  A  source,  which  generates  customers  for  the 
system.  A  wash,  where  the  customer  service  is  simulated,  and  an  exit  where  the  customers 
depart  from  the  system. 

Parallelism  is  achieved  through  multiple  instances  of  source  and  wash  LPs.  Figure  3.2 
shows  the  configuration  of  LPs  for  the  car  wash  simulation.  Although  different  configura¬ 
tions  are  possible  (i.e.,  more  exits  or  fewer  washes),  extensive  revisions  of  the  application 
source  code  would  be  necessary.  Since  the  multiple  instances  of  sources  and  washes  are 
mutually  exclusive  (i.e.,  no  data  dependencies)  in  this  configuration,  concurrent  execution 
of  these  LPs  is  possible. 


Figure  3.2.  Car  Wash  Simulation,  Logical  Processes 


The  interconnecting  arcs,  which  provide  message  passing  channels  between  LPs,  are 
established  during  initialization  and  remain  fixed  throughout  the  simulation.  The  simula¬ 
tion  is  deterministic  in  that  customer  arrival  rates  are  constant,  although  customers  are 
generated  at  different  frequencies  at  each  source.  Likewise,  the  service  rate  for  a  given 
wash  is  fixed;  however,  this  rate  also  varies  between  individual  wash  LPs.  The  routing  of 
customers  from  source  to  exit  follows  the  interconnecting  arcs  and  is  also  deterministic. 
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with  the  path  taken  at  forks  being  a  function  of  the  customer,  or  car  number. 

3.2.S  Parallel  Processing  Architecture  An  eight  node  Intel  iPSC/2  hypercube  pro¬ 
vided  the  parallel  processing  architecture  for  executing  a  gener;  DES.  Using  Franklin’s 
taxonomy  from  Section  2.3. 1.3,  this  architecture  can  be  classified  as  EI/LC/ML/MM,  since 
the  DES  is  event  driven,  uses  local  clocks  for  simulation  time,  multiple  lists  for  scheduling 
next  events,  and  multiple  processors  in  a  hypercube  configuration. 

The  basic  architecture  of  each  cube  node  is  a  self-contained  computer  with  a  CPU, 
local  memory  for  programs  and  data,  and  an  input/output  (I/O)  subsystem.  The  distin¬ 
guishing  feature  of  the  iPSC/2  is  the  set  of  bidirectional  I/O  channels  linking  each  node 
to  its  n  immediate  neighbors  in  the  hypercube. 

The  number  of  immediate  neighbors,  n,  also  represents  the  dimension  of  the  hy¬ 
percube.  With  n  =  3,  a  three  dimensional  graph  representation  of  the  iPSC/2  is  shown 
in  Figure  3.3.  This  figure  depicts  an  eight  node  configuration  of  the  hypercube  and  the 
nearest  neighbor  interconnections. 


Figure  3.3.  8  Node  Hypercube  Configuration  (17:1830) 


The  distributed  memory  architecture  of  the  hypercube  necessitates  message  passing 
between  nodes  when  information  must  be  shared.  The  bidirectional  I/O  channels,  linking 
nearest  neighbors,  play  a  central  role  in  the  hypercube’s  performance.  The  iPSC/2  uses 
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Direct- Connect  Modules  (DCM)  which  provide  the  necessary  routing  logic  in  the  hypercube 
interconnect  topology  for  this  purpose. 

Earlier  versions  of  the  hypercube  used  a  store-and-forward  communication  scheme 
requiring  approximately  1  ms  to  pass  messages  between  adjacent  nodes  (17:1832).  The 
DCM  of  the  zPSC/2  uses  a  16  bit  routing  probe  to  encode  node  address  information.  This 
allows  the  sending  node  to  establish  an  end-to-end  link  with  the  receiving  node  by  routing 
through  intermediate  nodes  along  the  path  (10:448-452). 

3.3  Simulation  Configuration 

Two  simulation  parameters  and  mappings  of  LPs  to  processing  nodes  were  varied 
during  the  analysis.  The  effects  of  feedback  on  the  simulation’s  execution  were  investigated 
by  either  routing  customers  from  the  exit  back  to  the  source  for  resubmission  or  allowing 
a  straight  exit. 

Additionally,  an  artificial  workload,  referred  to  as  a  spin  loop,  was  implemented  by 
inserting  varying  size  loops  with  floating  point  operations  within  each  LP.  This  variable 
workload  provided  a  more  realistic  and  general  simulation  for  analysis  as  compared  to  the 
strictly  deterministic  carwash.  The  effects  on  function  execution  frequency  and  overall 
function  execution  time  were  analyzed  by  varying  the  computational  intensity  between 
LPs. 

Figure  3.2  shows  the  carwash  LP  configuration  which  was  fixed  for  all  simulation 
runs.  Representing  the  physical  process  with  eight  LPs  provided  a  direct  one-to-one  map¬ 
ping  of  LPs  to  the  eight  processing  nodes  on  the  *PSC/2  hypercube.  To  investigate  the 
effect  of  multiple  processes  executing  on  each  node,  the  mapping  shown  in  Figure  3.4  was 
used.  Similar  to  the  original  mapping,  the  basic  LP  configuration  of  the  simulation  is 
unaltered;  however,  only  four  computing  nodes  of  the  hypercube,  each  having  two  pro¬ 
cesses,  are  employed.  Although  many  mapping  options  of  the  eight  LPs  are  possible,  this 
mapping  was  chosen  since  it  consolidates  communication  paths  and  minimizes  off-node 
communication  (21:4-2). 
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Figure  3.4.  Carwash  Configured  with  Two  LPs  per  Node 


3.4  Data  Collection  and  Analysis 

A  direct  means  for  parallel  algorithm  analysis  was  not  available  during  this  effort, 
therefore  the  DES  algorithm  had  to  be  instrumented  for  data  collection. 

3.4.1  Algorithm  Instrumentation  Section  3.2.1  described  the  levels  and  modularity 
of  the  SPECTRUM  testbed.  Each  level  of  SPECTRUM  has  a  corresponding  level  of  soft¬ 
ware  in  the  DES  algorithm.  The  carwash  application  level  (afitwash.c)  has  direct  visibility 
to  the  process  manager  (Ipjnan.c)  for  event  list  and  time  management  functions.  The 
process  manager  in  turn  has  visibility  to  the  synchronization  protocol  filter  (myfilters.c) 
and  the  node  level  message  passing  functions  (cube2.c). 

The  analysis  of  a  general  DES  required  instrumenting  the  functions  of  the  process 
manager,  the  protocol  filter,  and  the  node  level  routines.  Figures  3. 5-3. 7  show  the  function 
hierarchy  of  each  level  of  the  DES  algorithm. 

The  algorithm  was  instrumented  to  gather  function  execution  data.  The  relative 
function  execution  frequency  of  each  logical  process  was  calculated  at  each  level  (i.e.  filter. 
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Figure  3.5.  Function  Hierarchy  of  Process  Manager  Level 
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Figure  3.6.  Function  Hierarchy  of  Filter  Level 
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Figure  3.7.  Function  Hierarchy  of  Node  Level 


3-9 


process  manager,  and  node  manager)  of  the  DES  algorithm.  Therefore  data  was  collected 
from  each  LP  separately  and  then  categorized  based  on  algorithm  level. 

The  instrumentation  was  implemented  by  encapsulating  each  function  with  variables 
to  monitor  function  count  and  execution  time.  The  function  execution  count  was  incre¬ 
mented  during  each  successive  function  call,  while  the  cumulative  function  execution  time 
was  updated  with  the  difference  of  each  function’s  start  and  end  times.  Variable  updat¬ 
ing  was  concurrent  with  simulation  execution;  however,  the  final  data  totals  were  reported 
only  after  the  simulation  had  completed,  thus  avoiding  the  significant  overhead  of  updating 
data  files  during  simulation  execution. 

3. 4-2  Data  Analysis  Metrics  The  function  hierarchies  of  Figures  3. 5-3. 7  show  the 
interrelation  between  software  levels  and  functions  in  the  SPECTRUM  testbed  described 
in  Section  3.2.1.  The  complexity  of  determining  the  portion  of  overall  simulation  execution 
time  for  each  level  is  compounded  by  the  inter-level  dependencies  and  function  calls.  To 
simplify  the  analysis,  each  level  of  the  simulation  software  was  considered  separately  with 
the  knowledge  that  subfunction  calls  are  an  integral  part  of  the  algorithm  which  contribute 
to  the  caUing  function’s  total  execution  time.  A  relative  comparison,  with  respect  to 
the  total  number  of  function  executions  and  total  execution  time,  of  each  level’s  primary 
functions  (i.e.  top  tier  of  the  function  hierarchy)  was  then  made  for  each  level. 

By  averaging  the  data  of  each  level’s  primary  functions  across  all  simulation  LPs,  a 
general  view  of  the  individual  function  performance  was  obtained.  This  average  execution 
data  reveals  the  primary  functions  and  portions  of  the  algorithm  that  are  the  most  time 
consuming,  and  thus  have  the  greatest  need  for  acceleration. 

An  analysis  of  relative  simulation  execution  times  for  multiple  LPs  per  computing 
node,  as  shown  in  Figure  3.4,  was  also  made.  The  need  for  this  analysis  is  based  on  the 
fact  that  the  modeling  of  many  physical  processes  generates  more  logical  processes  than 
the  number  of  available  processing  nodes.  Modeling  with  large  complex  logical  processes, 
by  combining  smaller  LPs,  is  possible  but  may  not  always  result  in  a  one-to-one  mapping 
between  LPs  and  processors. 
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The  analysis  of  data  for  processing  nodes  with  multiple  LPs  considered  the  addi¬ 
tional  overhead  for  task  scheduling  and  switching.  The  measurement  of  switching  time 
for  individual  LPs  is  extremely  difficult,  and  in  fact,  was  not  possible  on  the  iPSC/2  as 
the  system  clock  resolution  is  in  milliseconds,  while  the  task  switching  time  for  the  Intel 
386  DX  is  about  17  fis  (19:5-335).  Task  scheduling  is  an  operating  system  function  on  the 
hypercube  that  is  performed  in  a  round-robin  fashion.  Each  LP  on  a  given  processing  node 
runs  for  approximately  50  ms,  or  until  it  blocks  to  send  or  receive  a  message  (18:2-56). 


IViffltry  PuActioQ  LPl 


Sabfaoctkm  Subfunctioe  Subfuwtk» 


L . J  U.J  L . J 


PoactiooLPl 


L . J 


Figure  3.8.  Task  Swapping  Runtimes 


To  get  an  accurate  representation  of  relative  simulation  execution  times,  the  data 
for  multiple  LPs  per  processing  node  had  to  be  adjusted  to  account  for  the  effects  of  task 
switching.  Figure  3.8  is  an  example  of  two  LPs  whose  execution  times  overlap  as  a  result 
of  task  switching.  The  runtime  for  the  second  LP,  (B),  occurs  while  LP  (A)  is  swapped  out 
and  the  entire  runtime  of  LP  (B)  is  included  within  (A’s)  total  execution  time.  Likewise, 
portions  of  (B’s)  execution  time  can  be  attributed  to  LP  (A)  regaining  the  CPU;  therefore, 
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both  LPs’  overall  runtimes  must  be  adjusted  accordingly.  This  adjustment  was  possible  by 
tracking  not  only  the  execution  start  and  finish  times,  but  also  the  LP’s  processing  node 
and  process  ID  number.  Once  a  clear  distinction  between  LPs  is  made,  the  adjustment  to 
function  execution  times  was  made  by  subtracting  the  execution  time  for  the  second  LP’s 
function  from  the  first.  After  adjusting  the  LP  execution  runtimes,  the  average  relative 
simulation  execution  times,  for  each  level’s  primary  functions,  were  calculated  as  described 
above  for  a  single  LP  per  processing  node. 

3.5  Logical  Process  Function  Execution 

The  analysis  of  simulation  test  data  for  single  and  multiple  LPs  per  processing  node, 
both  with  and  without  feedback,  and  for  various  spin  ioops  all  yielded  similar  results.  The 
communications  overhead  of  message  passing,  necessary  to  implement  the  conservative 
synchronization  protocol,  accounted  for  the  largest  relative  portion  of  s*.;*  ulation  execution 
time. 

The  data  collected  during  this  analysis  is  tabularized  in  the  following  sections.  The 
tables  are  not  exhaustive  in  that,  only  those  functions  with  the  largest  relative  execution 
times  for  each  level  were  of  primary  interest  and  thus  are  included.  Many  of  the  func¬ 
tions  that  were  omitted  required  little  or  no  measurable  execution  time  (i.e.,  Ip.terminate, 
node-terminate,  node.trash^vent),  regardless  of  the  simulation  configuration.  Although 
some  primary  functions  have  been  omitted,  the  tables  typically  account  for  80  percent,  or 
more,  of  the  overall  simulation  execution  time  at  each  level. 

3.5.1  One  LP  per  Processing  Node  Tables  3.1  through  3.3  show  the  primary  func¬ 
tions  with  the  largest  average  relative  execution  times  for  each  level  of  the  DBS  algorithm 
used  for  the  carwash  simulation.  These  tables  reflect  data  from  simulations  executed  with 
one  LP  per  processing  node  both  without  an  artificial  workload  and  with  equally  sized  spin 
loops  executed  on  all  LPs  (i.e.  sources,  washes,  and  exit)  in  the  simulation. 

Table  3.1  indicates  that  the  process  manager  level  expends  the  greatest  time,  on 
average,  posting  events.  In  the  conservative  synchronization  protocol  this  entails  sending 
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event  messages  to  one  or  more  designated  LPs,  while  null  messages  with  a  safe  lookahead 

time  for  tnat  channel  are  sent  on  all  remaining  output  channels. 

* 

Table  3.1.  Mean  Relative  Execution  Time  for  Process  Manager,  1  LP  per  node 


Primary 

Function 

No  Spin 

Spin  (1/1/1) 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

Ip-post  .event 

0.371 

0.334 

0.398 

0.379 

0.371 

readJpJnfo 

0.343 

0.379 

0.337 

0.333 

0.348 

Ip.postJnsg 

0.151 

0.150 

0.137 

0.183 

0.155 

The  tune  spent  executing  readJpJnfo  occurs  only  during  simulation  initialization. 
Although  a  significant  portion  of  the  overall  average  relative  execution  time,  its  one-time 
execution  make  it  an  unlikely  candidate  for  a  hardware  accelerator. 

Table  3.2  clearly  shows  that  a  significant  portion  of  the  overall  average  execution  time 
for  the  filter  level  is  dedicated  to  supporting  communication  requirements.  The  overhead 
of  the  send-nuli  function  translates  to  idle  processor  time.  During  this  idle  time,  the 
processor  waits  for  a  two-way  request  to  send  and  acknowledgement  from  the  receiving 
node  necessary  for  the  csend  subfunction  of  figure  3.6. 


Table  3.2.  Mean  Relative  Execution  Time  for  Filter,  1  LP  per  node 


Primary 

Function 

No  Spin 

Spin 

(1/1/1) 

FDBK 

NOFDBK 

NOFDBK 

sendjiuU 

0.400 

mSSM 

nuU-msgJlt 

0.253 

null-getJlt 

0.197 

HI 

IH 

The  primary  functions,  with  the  largest  average  relative  execution  times,  of  the  node 
manager  level  are  shown  in  table  3.3.  As  in  the  other  DES  algorithm  levels,  a  majority 
of  the  average  execution  time  is  directly  related  to  communication  overhead.  Similar  to 
the  senduiull  function  of  the  filter  level,  both  node.btm  (block.tilunessage)  and  nodeupm 
(receive-pending-messages)  are  implemented  with  subfunctions  (see  fl  jure  3.7)  that  require 
the  processor  to  wait  (i.e.  crecv  -  wait  to  receive,  and  cprobe  -  wait  for  message  on 
channel(18:2-17,19)). 
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This  waiting  for  communication  is  an  inherent  requirement  of  the  conservative  Chandy- 
Misra  synchronization  protocol  discussed  in  section  2.3.2.2..  This  waiting  again  indicp.t.?s 

CPU  idle  time  which,  given  the  opportunity,  could  be  redirected  to  other  simulation  re¬ 
quirements. 

Table  3.3.  Mean  Relative  Execution  Time  for  Node  Manager,  1  LP  per  node 


Primary 

Function 

No  Spin 

Spin  (1/1/1) 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

node.btm 

0.429 

0.344 

0.454 

0.286 

0.378 

node_rpm 

0.334 

0.359 

0.292 

0.372 

0.339 

node.sm 

0.086 

0.108 

0.101 

0.140 

0.109 

3.5.2  Variable  Spin  with  One  LP  per  Processing  Node  The  logical  process  mapping 
of  figure  3.2  was  used  for  IPs  of  varying  artificial  workloads  to  demonstrate  a  broader  and 
more  generic  class  of  simulations  than  the  deterministic  carwash.  The  workloads,  or  spin 
loops,  were  implemented  arbitrarily  but  in  proportions  considered  to  approximate  the 
expected  computational  intensity  of  the  particular  logical  process.  Therefore  the  ratios 
shown  in  table  3.4  through  table  3.6  represent  the  ratio  of  computational  workloads  of  the 
sources,  washes,  and  exit  respectively. 

The  effects  of  an  increased  workload  on  average  function  execution  time  are  apparent 
at  the  process  manager  level  shown  in  table  3.4.  The  sources  and  exit  continue  to  gen¬ 
erate  messages  at  approximately  the  same  rate,  however  the  washes,  with  a  considerably 
larger  workload,  require  more  time  to  process  a  backlog  of  incoming  messages.  Hence, 
considerable  time  is  spent  posting  received  messages  which  includes  a  linked-list  queueing 
subfunction  (see  figure  3.5)  of  time  complexity  0(n). 

The  time  dedicated  to  posting  events  is  significant  but  relatively  constant.  Since  the 
workloads  of  the  sources  and  exit  are  unchanged  they  continue  to  post  events  at  nearly  the 
same  rate  as  without  an  artificial  workload  (see  table  3.1). 

The  relative  execution  times  of  the  synchronization  protocol  are  shown  in  the  filter 
level  of  table  3.5.  The  addition  of  spin  loops  had  Httle  effect  on  relative  execution  times 
of  this  level  s  primary  functions.  The  communication  overhead  of  sending  messages  again 
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Table  3.4.  Mean  Relative  Execution  Time  for  Process  Manager,  1  LP  Variable  Spin 


Primary 

Function 

Spin  (1/5/1) 

Spin  (1/10/1) 

Spin  (1/20/1) 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

lp_post_event 

0.340 

0.328 

0.332 

0.315 

0.335 

0.316 

0.328 

lp_post_msg 

0.184 

0.201 

0.198 

0.210 

0.210 

0.221 

0.204 

readJpJnfo 

0.174 

0.169 

0.138 

0.140 

0.107 

0.107 

0.139 

accounts  for  a  significant  portion  of  the  average  execution  time.  The  slight  decline  in 
sendjiull  execution  time,  relative  to  LPs  with  no  spin  loops  (see  table  3.2),  is  the  result  of 
additional  computational  activity  t  the  wash  LPs,  hence  fewer  output  messages. 

Table  3.5.  Mean  Relative  E\ecution  Time  for  Filter,  1  LP  Variable  Spin 


Primary 

Function 

Spin 

(1/5/1) 

Spin 

(1/10/1) 

Spin  (1/20/1) 

OveraU 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

nuU-msg-flt 

0.303 

0.302 

0.362 

0.379 

0.456 

0.467 

0.378 

sendjiuU 

0.337 

0.324 

0.256 

0.234 

0.201 

0.192 

0.257 

null_get_flt 

0.215 

0.234 

0.214 

0.237 

0.205 

0.201 

0.218 

As  expected  the  node  manager  level  of  table  3.6  received  the  greatest  impact  from 
the  increase  in  computational  workload.  The  floating  point  calculations  used  in  the  spin 
loops  clearly  accounted  for  the  majority  of  average  function  execution  time  at  this  level. 


Table  3.6.  Mean  Relative  Execution  Time  for  Node  Manager,  1  LP  Variable  Spin 


Primary 

Function 

Spin 

(1/5/1) 

Spin 

(1/10/1) 

Spin  (1/20/1) 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

node -spin 

0.371 

0.425 

0.443 

0.500 

0.493 

0.549 

0.464 

node_rpm 

0.232 

0.253 

0.216 

0.236 

0.205 

0.228 

0.228 

node.btm 

0.249 

0.135 

0.193 

0.097 

0.154 

0.058 

0.148 

Table  3.6  also  shows  that  communication  overhead  accounts  for  significant  execution 
time  at  the  node  level,  in  spite  of  the  added  computational  workload.  The  bottleneck  at 
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the  wiish  LPs  requires  the  node  j-pm  function  to  receive  and  post  the  the  influx  of  messages, 
while  computations  in  the  spin  loop  are  executed. 

3.5.3  Multiple  LPs  per  Processing  Node  Tables  3.7  through  3.9  show  data  from 
simulations  executed  with  two  LPs  per  processing  node.  Data  both  with  and  without  an 
artificial  workload,  and  also  for  the  wash  LPs  performing  ten  times  the  computational 
intensity  of  the  sources  and  exit  is  shown. 

At  the  process  manager  level  (see  table  3.7)  the  lp.adv_time  function  accounted 
for  the  greatest  average  relative  execution  time.  Although  not  ?.  communications  related 
function,  each  time  advance  requires  a  memory  access  to  update  the  LP’s  local  simulation 
time.  The  use  of  sequential  random  access  memory  and  the  frequency  of  time  updates 
contributes  to  the  average  execution  time  for  this  function. 

Posting  events,  or  sending  output  messages,  requires  over  20  percent  of  the  average 
execution  time  at  the  process  manager  level.  This  portion  of  execution  time  is  dedicated  to 
updating  output  LPs  with  new  events  and  time  information  but  does  little,  unless  feedback 
is  involved,  to  advance  the  local  LP’s  simulation  state. 

Table  3.7.  Mean  Relative  Execution  Time  for  Process  Manager,  2  LPs  per  Node 


Primary 

Function 

wnnauiujrin 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

Ip-adv-time 

Ip-postjevent 

0.662 

0.250 

H 

m 

Prom  table  3.8,  approximately  60  percent  of  the  protocol  filter’s  average  execution 
time  is  spent  sending  null  messages  when  two  LPs  execute  on  each  processing  node.  Null 
messages  are  necessary  to  ensure  deadlock  avoidance  with  the  conservative  protocol(22:57), 
and  implementation  on  a  distributed  hypercube  architecture  requires  the  csend  subfunc¬ 
tion,  a  block  and  wait  communication  procedure  (see  figure  3.6). 

The  communication  overhead  associated  with  receiving  and  sending  messages  clearly 
accounts  for  the  majority  of  average  relative  execution  time  at  the  node  manager  level. 
Table  3.9  shows  that  node_rpm  and  send-sm  together,  average  approximately  80  percent 


3-16 


Table  3.8.  Mean  Relative  Execution  Time  for  Filter,  2  LPs  per  Node 


Primary 

Function 

Spin  (1/1/1) 

fpin  (1/10/1) 

Overall 

Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

sendjiuU 

0.689 

0.468 

0.587 

0.618 

0.591 

null_getJlt 

0.131 

0.308 

0.187 

0.179 

0.201 

null_msg_fit 

0.152 

0.141 

0.172 

0.150 

0.154 

of  the  node  manager  level’s  execution  time,  regardless  of  feedback  or  spin  loop  size.  Both 
functions  support  communication  and  are  implemented  with  subfunctions  which  block  and 
wait  (see  figure  3.7).  Here,  as  with  other  block  and  wait  type  functions,  better  utilization 
of  the  associated  processor  idle  time  might  be  possible. 

Table  3.9.  Mean  Relative  Execution  Time  for  Node  Manager,  2  LPs  per  Node 


Primary 

Function 

Spin  (1/1/1) 

Spin  (1/10/1) 

Over  ill 
Mean 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

node-rpm 

0.513 

0.542 

0.510 

0.413 

0.495 

node-sm 

0.319 

0.274 

0.321 

0.301 

0.304 

node.btm 

0.168 

0.184 

0.169 

0.184 

0.176 

3.5.4  Speedup  Potential  The  data  in  tables  3.1  through  3.9  does  not  indicate  the 
portion  of  overall  simulation  execution  time  expended  in  ea*h  level  of  the  DES  algorithm. 
However,  the  speei  ■  potential  for  the  individual  levels  was  approximated  by  considering 
each  level  of  the  DES  algorithm  independent  of  the  others. 

Ideally,  the  addition  of  a  hardware  accelerator  will  reduce  the  average  execution  time 
of  the  most  tim^  consuming  primary  functions  to  near  zero.  Therefore,  by  representing 
each  level’s  total  execution  time  as  1,  the  speedup  potential  for  each  function,  Sf,  is 
approximated  by: 


Sf  = 


1 

1  -  fif 


(3.1) 


Where  fif  is  the  overall  mean  execution  time  for  the  primary  function  at  that  level. 
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Total  speedup  potential  for  each  level,  Si,  was  approximated  by  considering  the 
cummulative  effects  of  reducing  the  average  execution  time  of  all  time-consuming  primary 
functions  for  each  level  to  zero.  Eliminating  the  average  execution  time  for  all  primary 
functions  at  a  given  level  results  in  the  speedup  potential  shown  in  equation  3.2. 


5,= 


1  -  E”=i 


(3.2) 


The  speedup  potential  for  each  DES  algorithm  level,  and  the  simulation  configura¬ 
tion  analyzed,  are  summarized  in  tables  3.10-3.12.  Tables  3.10  and  3.11  (i.e.,  1  LP  per 
node)  show  little  potential  for  significant  speedup  in  any  given  level  of  the  DES  algorithm. 
Assuming  equal  execution  times  for  each  level  of  the  DES  algorithm,  averaging  the  sums 
of  Equation  3.2,  indicates  the  potential  total  speedup  is  only  4.83  for  one  LP  per  node 
with  no  spin  and  3.51  times  for  one  LP  per  node  with  added  spin. 

iable  3.10.  Speedup  Potential  by  Algorithm  Level  (1  LP  per  node,  no  spin) 


Algorithm 

Level 

Primary 

Functions 

Mean 

Time 

Ratio 

Speedup 

Potential 

Sf 

Speedup 

Potential 

Si 

LP_MAN 

Ip-postjevent 

0.371 

1.590 

2.110 

Ip-postjnsg 

0.155 

1.183 

FILTER 

sendjiuU 

0.417 

1.715 

6.623 

null_msgJt 

0.233 

1.304 

null-get  Jt 

0.199 

1.248 

NODE 

node.btm 

0.378 

1.608 

5.747 

node_rpm 

0.339 

1.513 

node_sm 

0.109 

1.122 

Considerably  better  speedup  potential  is  exhibited  when  two  LPs  share  a  single  node 
as  shown  in  table  3.12.  Using  the  assumption  of  equal  execution  times  per  algorithm  level 
and  Equation  3.2,  the  total  potential  simulation  speedup,  for  two  LPs  per  processing  node, 
is  23.62  times. 

A  more  accurate  approximatiton  of  potential  simulation  speedup  was  made  by  using 
the  measured  average  execution  times  for  each  level  of  the  DES  algorithm.  The  calculation 
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Table  3.11.  Speedup  Potential  by  Algorithm  Level  (1  LP  per  node,  variable  spin) 


Algorithm 

Level 

Primary 

Function 

Mean 

Time 

Ratio 

Speedup 

Potential 

Sf 

Speedup 

Potential 

Si 

LP.MAN 

Ip-postjevent 

0.328 

1.488 

2.137 

lp-post_msg 

0.204 

1.256 

FILTER 

null_msg_flt 

0.378 

6.803 

sendjiuU 

0.25' 

1.346 

null -get  _flt 

0.218 

1.279 

NODE 

node_rpm 

0.228 

1.603 

node.btm 

0.148 

Table  3.12.  Speedup  Potential  by  Algorithm  Level  (2  LPs  per  node) 


Algorithm 

Level 

Primary 

Function 

Mean 

Time 

Ratio 

Speedup 

Potential 

Sf 

Speedup 

Potential 

Si 

LP.MAN 

lp-adv.time 

0.694 

3.268 

12.346 

lp-post_event 

0.225 

1.290 

FILTER 

send-null 

2.445 

18.519 

null-get  Jit 

1.252 

nuU-msg-flt 

0.154 

1.182 

1  NODE 

node_rpra 

0.495 

40.000 

node-sm 

0.304 

1.437 

node.btm 

0.176 

1.214 
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of  overall  speedup  potential  was  made  as  a  weighted  average,  using  each  level’s  overall  mean 
portion  of  simulation  execution  time  and  its  potential  for  speedup  at  each  level.  The  overall 
mean  portion,  of  simulation  execution  time  for  each  algorithm  level  is  given  for  one  LP 
per  node  in  Tables  3.13  and  3.14,  and  two  LPs  per  node  in  Table  3.15. 

Table  3.13.  Ratio  of  Simulation  Execution  Time  by  Algorithm  Level  (1  LP  per  Node,  no 
spin)  ’ 


Algorithm 

Level 

No  Spin 

Overall 
Mean  (t„) 

FDBK 

NOFDBK 

LP_MAN 

0.265 

0.301 

0.283 

FILTER 

0.299 

0.311 

0.305 

NODE 

0.436 

0.387 

0.412 

Table  3.14. 

Ratio  of  Simulation  Execution  Time  by  Algorithm  Level  (1  LP  per  Node, 
variable  spin) 

Algorithm 

Level 

Spin  (1/5/1) 

Spin  (1/10/1) 

Spin  (1/20/1) 

Overall 
Mean  (t.,) 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

LP_MAN 

0.323 

0.321 

0.344 

0.357 

0.377 

0.390 

0.352 

FILTER 

0.274 

0.277 

0.252 

0.246 

0.215 

0.215 

0.247 

NODE 

0.403 

0.402 

0.404 

0.396 

0.409 

0.395 

0.402 

Table  3.15.  Ratio  of  Simulation  Execution  Time  by  Algorithm  Level,  2  LPs  per  Node 


Algorithm 

Level 

No  Spin 

Spin  (1/10/1) 

Overall 
Mean  (t„) 

FDBK 

NOFDBK 

FDBK 

NOFDBK 

LPJ^IAN 

0.419 

0.287 

0.292 

0.205 

0.301 

FILTER 

0.274 

0.329 

0.318 

0.207 

0.282 

NODE 

0.306 

0.385 

0.390 

r  0.588 

0.417 

The  total  potential  for  simulation  speedup  for  each  configuration,  Sp^t,  was  then 
calculated  using  the  equation; 


Spot  — 


1 


E3 
i=l  Si. 


(3.3) 
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The  resulting  potentials  for  simulation  speedup  are; 

•  1  LP  per  node,  no  spin  =  3.97 

•  1  LP  per  node,  variable  spin  =  2.13 

•  2  LP  per  node,  combined  =  19.99 

3. 6  Summary 

The  Intel  tPSC/2  hypercube  used  to  analyze  a  general  discrete  event  simulation 
showed  adequate,  yet  not  optimum  performance.  Idle  processor  time,  resulting  from  com¬ 
munication  overhead  during  processor  message  passing,  is  clearly  an  area  where  improve¬ 
ment  is  needed. 

The  potential  for  simulation  speedup  was  calculated  as  a  function  of  average  exe¬ 
cution  time,  for  each  level  of  the  simulation  algorithm,  and  the  functions,  within  those 
levels,  that  required  the  largest  portion  of  that  execution  time.  The  potential  for  speedup 
is  clearly  less  than  desired,  ranging  from  two  times  for  one  logical  process  per  node,  to 
nearly  four  times  when  two  logical  processes  are  executed  per  node. 

The  greater  potential  for  speedup,  exhibited  with  more  logical  processes  operating  per 
node,  is  proportional  to  the  increase  in  processor  communication  load  and  its  associated  idle 
time.  Varying  the  processor  computation  load  had  little  effect  on  which  functions  required 
the  greatest  portion  of  simulation  execution  time.  The  communications  required  of  the 
message  intensive  conservative  synchronization  protocol  remained  significant  regardless  of 
processor  workload. 

Better  utilization  of  idle  processor  time,  associated  with  sending  and  receiving  mes¬ 
sages,  is  clearly  a  viable  approach  to  accelerating  the  execution  of  discrete  event  simulations 
on  the  Intel  :PSC/2  hypercube. 
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IV.  DES  Coprocessor  Design 


4.1  Introduction 

When  considering  hardware  acceleration  options  for  the  execution  of  discrete  event 
simulations  on  a  multiprocessor  architecture,  two  approaches  are  possible.  The  parallel  ar¬ 
chitecture  may  be  viewed  from  a  system  level  and  alternatives  for  improving  the  efficiency 
of  the  system  (i.e.,  memory  bandwidth,  interconnection  networks,  etc.)  may  be  consid¬ 
ered.  An  alternative  approach  is  to  consider  the  individual  processors,  which  make  up  the 
parallel  system,  and  consider  hardware  alternatives  for  improving  processor  efficiency  and 
utilization,  thereby  improving  the  overall  system  performance. 

The  parallel  architecture  considered  in  this  thesis  was  the  Intel  iPSC/2  hypercube. 
This  second  generation  hypercube  incorporates  several  advances  over  the  earlier  iPSC/1 
which  improve  the  architecture’s  performance  -  primarily  the  more  powerful  80386,  32-bit 
microprocessor,  and  the  enhanced  circuit-switching  internode  communications  provided  by 
the  direct  connect  modules  (17:1831). 

Currently,  efforts  are  underway  at  Intel  to  improve  the  system  performance  via  com¬ 
munication  modules  that  implement  a  mesh  interconnection  network,  to  supplement  to  the 
existing  hypercube  configuration(27).  Intel’s  research  efforts  will  undoubtedly  improve  the 
overall  performance  of  the  parallel  architecture  for  a  wide  variety  of  applications.  This  the¬ 
sis,  however,  focuses  on  the  specific  application.  Discrete  Event  Simulation  (DES),  whose 
potential  for  accelerated  execution  is  improved  by  focusing  on  the  hardware  implementa¬ 
tion  of  the  primary  DES  algorithm  functions. 

With  the  goal  of  improved  performance  for  DES,  the  design  approach  and  require¬ 
ments  for  such  an  application  specific  accelerator  are  described  in  this  chapter.  This 
design  focuses  on  the  individual  processor  level  rather  than  the  larger  parallel  system,  yet 
remains  within  the  constraints  of  the  hypercube  architecture.  Although  higher  level  system 
improvements  may  be  possible,  the  degree  of  improvement  for  a  specific  application,  such 
as  discrete  event  simulation,  would  be  overshadowed  by  more  general  and  widely  applicable 
enhancements. 
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4-2  Accelerator  System  Requirements 

Bottlenecks  to  the  efficient  execution  of  Discrete  Event  Simulations  on  the  iPSC/2 
hypercube  are  clearly  evident.  Although  the  carwash  simulation  analyzed  in  Chapter  3  is 
but  a  single  model,  the  use  of  artificially  induced  workloads  and  different  logical  process 
mappings  provided  an  indication  of  general  DES  execution  performance  on  the  hypercube 
architecture. 

The  conservative  synchronization  protocol  in  use  requires  continual  communication 
between  logical  processes,  and  therefore  requires  an  efficient  interconnection  network.  Re¬ 
gardless  of  the  communications  efficiency,  some  processor  idle  time  is  expected,  as  logical 
processes  must  wait  for  messages  at  various  points  throughout  the  simulation. 

The  interconnection  network  of  the  iPSC/2,  incorporating  the  direct  connect  mod¬ 
ules,  is  being  upgraded  by  Intel.  Additionally,  because  of  corporate  proprietary  restrictions 
(27),  the  lack  of  available  hardware  documentation  makes  the  independent  upgrade  this  of 
system  virtually  impossible  without  considerable  reverse  engineering. 

4-2.1  Processor  Utilization  While  the  independent  improvement  of  the  hypercube’s 
communication  network  is  unlikely,  minimizing  processor  idle  time  during  communications 
is  possible.  Increasing  processor  utilization  is  a  paramount  concern,  since  discrete  event 
simulations  spend  as  much  as  50  percent  of  the  execution  time  receiving  or  sending  messages 
(see  Tables  3.8  and  3.9). 

Processor  idle  time,  associated  with  communication  overhead,  may  be  reduced  by 
incorporating  a  coprocessor  to  handle  communication,  thus  freeing  the  processor  for  op¬ 
erations  that  directly  support  of  the  simulation.  A  discrete  event  simulation  coprocessor, 
providing  support  for  simulation  message  passing,  would  enhance  nearly  every  major  sim¬ 
ulation  task.  Initializing  the  simulation  requires  the  receipt  and  dissemination  of  logical 
process  information.  Posting  incoming  messages  requires  monitoring,  receiving,  and  main¬ 
taining  status  on  incoming  channels.  Posting  outgoing  events  involves  sending  both  event 
and  null  messages.  The  last  major  function,  getting  events,  while  not  directly  a  communi¬ 
cations  function,  involves  events  that  were  previously  received  and  posted  via  a  communi¬ 
cation  process.  Relieving  the  processor  of  these  synchronization  protocol  functions  allows 
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redirection  of  the  processing  power  to  simulation  execution  and  ultimately  to  accelerated 
simulation  execution  times. 

^.2.2  Memory  Management  The  distributed  parallel  architecture  of  the  iPSC/2 
was  studied  to  investigate  the  requirements  for  a  DES  hardware  accelerator.  This  platform 
incorporates  technology  and  an  architectural  design  that  limit  the  enhancement  options 
available  for  a  DES  accelerator.  The  use  of  an  Intel  80386  microprocessor,  with  its  inte¬ 
grated  segmentation  and  paging  memory  management  unit  (MMU),  reduces  the  likelihood 
of  accelerating  performance  through  improved  hardware  for  memory  management. 

The  on-chip  MMU  performs  all  virtual-to-physical  address  translations,  segmenta¬ 
tion,  and  paging  violation  checking.  The  MMU  uses  pipelining  and  parallel  execution  to 
generate  physical  addresses  by  storing  the  translation,  segment,  and  page  descriptors  on 
chip.  Additionally,  the  use  of  a  translation  look-aside  buffer  (TLB)  significantly  reduces 
paging  translation,  which  is  otherwise  performed  through  a  two-step  table  lookup.  Al¬ 
though  the  TLB  has  a  high  hit  -atio  of  98  percent,  the  paging  unit  is  supplemented  with 
special  purpose  hardware  capable  of  performing  a  page  translation  in  nine  clock  cycles 
(11:18-21).  Therefore,  as  with  the  hypercube’s  interconnection  network,  the  potential  to 
improve  performance  via  improved  memory  management  is  small. 

4-3  Design  Approach 

The  design  of  a  coprocessor  specifically  for  distributed  discrete  event  simulations 
using  the  conservative  synchronization  protocol  has  yet  to  be  documented  in  the  literature. 
Therefore,  this  initial  design  takes  a  rather  abstract,  chip-level,  approach  to  defining  the 
requirements  necessary  to  implement  such  an  application  specific  coprocessor. 

Using  a  chip-level  approach,  the  coprocessor  requirements  were  defined  in  terms 
of  input/output  (I/O)  response  and  the  algorithm  that  the  chip  implements(2:5).  The 
necessary  operations  of  the  Discrete  Event  Simulation  (DES)  coprocessor  were  defined 
using  a  behavioral  description  of  the  coprocessor’s  functional  components.  The  IEEE 
standard,  VHSIC  Hardware  Description  Language  (VHDL),  was  used  as  the  design  tool 
to  implement  this  behavioral  description. 
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4.3.1  Hardware  Implementation  of  DBS  Algorithm  Based  on  the  significant  com¬ 
munications  overhead  and  subsequent  idle  processor  time  associated  with  the  DES  al¬ 
gorithm,  virtually  every  portion  of  the  simulation  algorithm  exhibits  some  potential  for 
acceleration  through  a  hardware  implementation.  Prom  Figures  3.5  and  3.6  it  is  clear  that 
the  primary  functions  at  both  the  logical  process  manager  and  the  filter  levels  require 
communications  interaction  through  subfunction  calls. 

The  potential  for  speedup  addressed  in  Section  3.5.4  is  realizable  if  the  processor  has 
actual  work  pending.  Pending  jobs  could  receive  processor  time  in  lieu  of  the  processor 
remaining  idle  while  the  currently  scheduled  job  waits  for  communications.  Hence  the 
purpose  of  this  design  is  to  implement  the  DES  algorithm  in  hardware  and  thus  relieve 
the  processor  from  the  administrative  overhead  associated  with  the  conservative  synchro¬ 
nization  protocol  as  it  exists  in  the  SPECTRUM  testbed.  The  hardware  coprocessor  will 
provide  processing  for  incoming  messages,  schedule  and  package  outgoing  messages,  man¬ 
age  simulation  time,  monitor  communication  arc  status  and  the  next  event  list,  and  provide 
event  inputs  to  the  processor  when  requested.  The  coprocessor  will  support  the  simulation 
execution  for  all  logical  processes  on  the  node,  transparent  to  the  processor’s  operation 
and,  if  a  sufficient  workload  is  available,  increase  overall  processor  utilization. 

4.3.2  Process  Model  As  a  supplemental,  application-specific  system  for  the  Intel 
80386  microprocessor,  the  DES  coprocessor  was  viewed  as  a  finite  state  machine.  State 
transitions  are  controlled  by  the  processor  when  needed  and  follow  the  process  model  graph 
of  Figure  4.1. 

Primarily  an  I/O  device,  the  coprocessor  responds  to  control  signals  from  the  pro¬ 
cessor.  Figure  4.1  shows  the  three  necessary  coprocessor  states-  start,  cpuJo,  and  execute. 

With  power  applied  and  no  processor  request  pending,  the  coprocessor  remains  in  the 
start,  or  idle  state.  Basic  housekeeping  functions,  which  consist  primarily  of  maintaining 
the  coprocessor’s  local  memory,  are  performed  while  waiting  for  processor  requests. 

Transition  to  the  cpu-io  state  is  controlled  by  the  processor.  The  coprocessor,  which 
resides  in  the  processor’s  I/O  address  space,  monitors  the  CPU  control  signals  and  the 
system  data  bus  for  opcodes  and  operands  when  activated  by  the  processor. 
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Figure  4.1.  Process  Model  Graph 
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The  execution  state  is  entered  at  the  completion  of  cpu  Jo.  All  necessary  DES  func¬ 
tions  are  executed  while  in  this  state.  Data  flow  within  the  coprocessor  is  controlled 
primarily  from  this  state;  however,  control  of  parallel  input  and  writing  to  local  Random 
Access  Memory  (RAM)  from  the  cpu  Jo  state  are  also  supported. 

The  clock  and  run  processes  are  included  for  completeness.  Coprocessor  timing  is 
referenced  to  the  main  processor’s  system  clock,  while  the  run  process  is  analogous  to  chip 
enable  or  connection  of  primary  drive  voltage.  Similar  tc  the  Intel  80386  microprocessor, 
the  DES  coprocessor  is  designed  to  use  an  internally  generated  cloc  'it  half  the  system 
clock  frequency  for  instruction  executions  (19:5-347). 

4.3.3  DES  Coprocessor  Interface  The  DES  coprocessor  relies  on  the  Intel  80386 
microprocessor  for  controlling  state  transitions  and  data  transfers.  Similar  to  the  Intel 
80387  DX  math  coprocessor,  this  application-speciflc  coprocessor  interfaces  to  the  system 
bus  architecture  for  all  data  transfers  and  is  connected  directly  to  the  CPU  control  signals 
(19:5-441,447). 

Control  of  the  coprocessor  relies  on  two-way  communication  beiv'efm  the  CPU  and 
the  coprocessor.  Operating  in  the  processor’s  addressable  I/O  spar.,  the  coprocessor  is 
controlled  via  two  address  lines  (A15  and  A2)  and  the  M  JO  line  from  the  processor.  The 
choice  of  a  specific  address  location  for  the  coprocessor  was  arbitrary  bat  falls  within  the 
general  requirements  specified  for  the  Intel  80386. 

The  Intel  80386  can  address  16K,  32-bit  ports  in  its  I/O  ad4r§ss  space,  OH-FFFFH. 
Intel  has  reserved  addresses  00F8H-00FFH  for  use  with  its  math  coprocessor  (80387), 
which  is  activated  by  asserting  address  line  A31  high  and  toggling  address  line  A2  to 
distinguish  data  from  opcodes(19:5-308,309).  Intel  has  advised  that  while  I/O  addresses 
are  available,  the  tPSC/2’s  NX  operating  system  reserves  additional  I/O  addresses  that  .'ire 
unique  to  each  system’s  configuration,  and  can  only  be  identified  through  reference  to  the 
system  documentation  which  was  unavailable  at  this  time(23).  Addressing  the  coprocessor 
at  the  chosen  I/O  port  is  therefore  similar  to  the  Intel  80387,  in  that  asserting  address 
line  A15  and  MJO,  activates  the  coprocessor,  while  data  and  opcodes  are  distinguished 
through  address  line  A2. 
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Additional  control  lines  ?ae  required  to  provide  coprocessor  state  and  to  send  requests 
to  the  CPU.  The  Intel  803&5  monitors  a  reaqy  input,  READY^,  to  terminate  or  wait  on 
bus  transactions  with  a  bus  slave(  19:5-349).  The  coprocessor  supports  this  requirement 
with  a  ready  signal,  READYO,  which  it  asserts  low  to  end  bus  transfer  cycles,  as  needed 
by  the  Intel  80386. 

An  interrupt  request  line  is  also  required  for  the  coprocessor  interface.  Since  direct 
access  to  the  network  switching  circuitry  of  the  Direct  Connect  Module  (DCM)  is  not 
possible,  the  coprocessor  will  rely  on  the  Intel  80386  to  pass  outgoing  messages  to  the 
DCM.  When  needed  the  coprocessor  will  assert  an  interrupt  request  for  message  output 
and,  if  not  masked  by  the  CPU,  will  forward  the  outgoing  message  to  the  DCM. 

Similar  to  the  Intel  math  coprocessor,  the  DES  coprocessor  is  designed  with  addi¬ 
tional  control  lines  to  indicate  an  active  state  or  an  error  l  ondition.  When  active  the  BUSY 
signal  tells  the  CPU  that  the  coprocessor  is  executing  an  instruction,  while  the  ERROR 
signal  reflects  a  coprocessor  exception. 

4-3.4  DES  Coprocessor  Functional  Components  A  block  diagram  of  the  DES  co¬ 
processor  system  is  shown  in  Figure  4.2.  The  design  consists  of  five  functional  blocks 
needed  to  implement  the  discrete  event  simulation  using  a  conservative  synchronization 
protocol  and  parallel  I/O  ports  to  interface  with  the  CPU’s  system  data  bus. 

As  a  chip-level  design  using  VHDL  behavioral  descriptions,  detailed  hardware  spec¬ 
ifications  are  not  given  in  this  design.  The  design  objective  is  to  specify  the  hardware 
characteristics  and  behavior  necessary  to  implement  the  simulation  algorithm  as  it  ex¬ 
ists  in  the  SPECTRUM  testbed,  while  ensuring  compatibility  with  the  Intel  80386  32-bit 
architecture  used  on  the  zPSC/2  hypercube. 

4.3.4- 1  Parallel  Input/Output  A  parallel  interface  with  the  system  data  bus 
of  the  Intel  80386  requires  the  use  of  32- bit  wide  buffered  latches.  Using  this  approach, 
the  coprocessor  can  take  fuU  advantage  of  doubleword  aligned  bus  transfers  and  be  discon¬ 
nected  via  a  high  impedance  state  when  no  bus  transaction  is  active. 

The  parallel  1/ 0  port  design  closely  follows  that  presented  by  Armstrong  in  his  VHDL 
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Figure  4.2.  DES  Coprocessor  System 


description  of  the  Mark  II  processor(2:120-123).  Referencing  Figure  4.2,  the  parallel  input 
and  output  blocks  are  designed  with  identical  behavioral  descriptions  and  the  direction  of 
data  flow  is  determined  by  the  I/O  port  mapping  and  mode  selection  for  each  device. 

The  respective  1/ 0  port  is  activated  by  the  coprocessor  when  needed  via  a  device 
select  line,  DS2.  Assertion  of  this  line  is  in  response  to  interface  signals  between  the 
coprocessor  and  the  CPU.  Parallel  input  requires  the  additional  interface  of  a  strobe  line, 
addrjtrb#,  from  the  CPU  to  determine  when  data  is  valid  on  the  system  bus  and  thus 
may  be  latched. 

Interfacing  with  slave  units  is  not  anticipated  or  required  at  this  time,  therefore  no 
provision  for  an  interrupt  capability  is  included  in  the  design  of  the  I/O  ports. 

4.3.4-2  Random  Access  Memory  The  Random  Access  Memory  (RAM)  of  Fig- 
ure  4.2  supports  several  critical  functions  in  the  DES  coprocessor  design.  Its  primary  pur¬ 
pose  is  to  hold  simulation  information  relative  to  each  logical  process  simulated  on  the 
processing  node.  Additionally,  it  provides  storage  space  for  coprocessor  required  assembly 
code  and  swap  space,  if  needed,  to  support  overflow  from  the  Content  Addressable  Memory 
(CAM). 

The  RAM  memory  incorporates  several  fundamental  design  decisions.  The  most  ob¬ 
vious  is  the  use  of  doubleword  (32-bit)  alignment  for  all  memory  transactions.  This  allows 
a  direct  interface  to  the  system  data  bus  of  the  CPU  and  permits  transfer  of  data  with 
fewer  bus  cycles.  Additionally,  the  data  format  currently  used  by  the  iPSC/2  for  inte¬ 
gers,  which  are  the  predominant  data  type  used  in  the  simulation  analyzed,  is  a  four-byte 
doubleword.  The  choice  of  doubleword  alignment  also  eliminates  the  need  for  two  address 
lines,  as  the  distinction  between  the  four  individual  bytes  that  makeup  the  doubleword  is 
no  longer  necessary. 

Determining  the  size  of  RAM  memory  required  is  based  on  several  assumptions  rel¬ 
ative  to  the  simulations  supported  by  the  DES  coprocessor.  A  portion  of  RAM  is  required 
by  each  logical  process  being  simulated  on  the  processor  and  the  iPSC/2  operating  sys¬ 
tem  allows  a  maximum  of  20  processes  per  computing  node(18:l-37).  Therefore  the  RAM 
design  incorporates  a  separate  partition  to  store  each  IP’s  simulation  data.  Additionally, 
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an  address  pointer  table  with  20  entries  is  provided  to  reference  the  base  address  for  each 
LP’s  partition. 

The  RAM  design  is  organized  as  shown  in  Figure  4.3.  Each  logical  process  simulated 
on  a  given  CPU  maintains  data  identifying  the  other  logical  processes  on  vhich  it  is 
dependent  (i.e.,  must  communicate  with).  Assuming  a  maximum  of  ten  input  and  output 
logical  processes  are  to  communicate  with  each  LP  on  the  node,  20  doubleword  addresses 
are  available  for  each  LP  on  the  node.  The  limit  of  ten  inputs/outputs  was  chosen  based 
on  the  fan-in  and  fan-out  heuristic  for  similar  electronic  devices. 

Representation  of  the  unique  identities  for  all  input /output  LPs  within  the  20  ad¬ 
dresses  in  coprocessor  RAM  is  contingent  on  the  ability  to  represent  each  LP  Jd  with  only 
32  bits.  Each  LP  is  identified  by  its  processing  node  number  and  its  logical  process  number 
on  that  node,  both  of  which  are  32-bit  integers  on  the  iPSC/2.  The  field  size  needed  to 
represent  this  information  however  is  significantly  less  when  using  an  eight  node  hyper¬ 
cube,  limited  to  20  LPs  per  node.  Therefore  the  RAM  design  will  store  LP  Jds  as  a  single 
doubleword,  with  the  node  number  occupying  the  upper  two  bytes  and  the  process  number 
contained  in  the  lower  two  bytes. 

In  addition  to  the  above  memory  requirements,  each  LP  must  maintain  data  that 
reflects  the  simulation  state  and  supports  the  conservative  synchronization  protocol.  The 
local  simulation  time  is  unique  for  each  LP  and  occupies  a  doubleword  within  its  copro¬ 
cessor  RAM  partition.  Additionally,  a  single  doubleword  location  is  provided  for  each  LP 
to  maintain  the  inherent  delay  time  for  the  LP’s  simulation  process.  Storage  of  the  LP’s 
delay  time  is  only  necessary  for  deterministic  simulations,  where  the  LP’s  execution  time 
is  constant  and  can  be  stored  during  initialization. 

The  number  of  input  and  output  communication  channels  is  also  required  to  monitor 
incoming  and  outgoing  messages.  Similar  to  the  LP  Jds,  the  memory  representation  for  the 
number  of  input  and  output  arcs  is  stored  as  a  single  doubleword.  The  number  of  inputs 
occupying  the  upper  two  bytes  and  the  number  of  outputs  in  the  lower  two  bytes. 

The  last  entry  ?n  each  LP’s  RAM  partition  is  a  status  register  containing  input 
message  information.  As  described  in  Section  2. 3. 2. 2,  a  message  must  be  received  on  every 
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OUTPUT_ARC_IDs 
(10)  max 
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input  arc  to  ensure  the  next  event  may  be  safely  executed.  The  ARCS  JN.STATUS  register 
reflects  received  messages  with  a  ‘1’  in  the  bit  position  that  corresponds  to  the  input  LP 
and  a  ‘0’  otherwise. 

To  meet  the  above  memory  requirements,  the  DES  coprocessor  design  requires  4Kbytes 
of  dynamic  RAM.  This  total  is  obtained  by  considering  that  a  minimum  of  2Kbytes, 

store  simulation  data  and  a  working  buffer  of 
2Kbytes  is  anticipated  for  CAM  overflow  and  local  code  storage. 

4.3.4. 3  Content  Addressable  Memory  A  Content  Addressable  Memory  (CAM) 
IS  included  in  the  DES  coprocessor  design  to  store  and  maintain  the  next  event  lists  for 
each  logical  process  executing  on  the  node.  Although  the  effects  of  next  event  list  manage¬ 
ment  were  not  significant  in  the  car  wash  simulation  analyzed,  the  number  of  events  stored 
requires  a  time  complexity  of  0(n)  when  a  sequential  algorithm  is  used  for  an  in-order 
retrieval.  The  CAM  provides  the  ability  to  perform  searches  of  memory  in  parallel  and 
thus  reduce  the  time  complexity  to  0(1). 

As  with  the  RAM  memory,  the  size  of  CAM  required  must  be  justified  in  the  design. 
Unlike  the  RAM  however,  the  CAM  requires  data  storage  in  compliance  with  a  strict 
format  in  order  to  operate  effectively. 

The  CAM  size  determination  and  searching  process  are  defined  in  terms  of  how  each 
simulation  event  is  stored  in  the  CAM.  Figure  4.4  shows  an  event  list  entry  in  the  CAM. 
AJthoiigh  some  encoding  of  event  information  will  be  done  by  the  DES  coprocessor,  a  total 
of  78  bits  is  necessary  to  completely  define  each  event  on  the  next  event  list. 

The  vahd  bit  is  used  by  the  CAM  to  determine  the  status  of  each  entry  (i.e.,  if 
it’s  been  used).  The  remaining  fields  uniquely  identify  each  event.  The  “TO-LP”  field 
identifies  which  LP  on  the  node  is  to  receive  that  event.  This  field  requires  five  bits  to 
uniquely  identify  which  of  the  20  possible  LPs  receives  the  event.  The  “FROM_NODE” 
and  “FROM-LP”  fields,  which  uniquely  identify  the  sending  LP,  are  also  encoded  by  the 
DES  coprocessor.  The  “FROMJSfODE”  is  limited  to  eight  possible  nodes  on  the  iPSC/2, 
hence  it  requires  only  3  bits.  Similarly,  the  “FROM  J<P”  is  one  of  20  possible  LPs  on  that 
node  and  is  therefore  represented  with  five  bits. 
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The  “TIME.TAG”  and  “MEMR_PTR«  are  generated  by  the  tPSC/2  and  remain 
in  their  original  32-bit  format.  The  “MEMRJPTR”  must  remain  unchanged  as  the  CPU 
requires  this  memory  reference  to  locate  the  actual  event  and  its  associated  data  structures, 
which  are  maintained  in  CPU  local  memory. 


VALID 
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Figure  4.4.  CAM  Event  Field  Entries 


The  si22  of  CAM  required  for  next  event  storage  will  vary  as  a  function  of  the  physical 
process  being  simulated  and  the  number  of  events  generated  during  the  simulation.  CAM 
si2e  for  a  general  purpose  DEb  coprocessor  is  based  on  the  following  assumptions  and 
median  values: 

•  median  value:  10  LPs/node 

•  median  value:  5  input  arcs/LP 

•  assume:  10  events/input  pending 

Given  these  median  values  and  assuming  an  average  of  10  events  are  pending  at  each  node, 
the  number  of  events  stored  in  the  CAM  is  calculated  as: 

(10^)(5^)(10^^)  =  500^ 

Based  on  the  storage  requirement  of  500  events/node,  a  CAM  of  4Kbytes  minimum  is 
needed,  since  each  event  requires  78  bits. 

^  3.4-4  DES  Coprocessor  The  DES  coprocessor’s  operation  follows  the  basic 
approach  envisioned  by  von  Neumann  and  described  by  Hayes(16:179-183).  A  small  single¬ 
address  instruction  set  and  a  minimal  number  of  registers  are  used  in  a  fetch,  decode, 
execute,  and  store  sequence,  that  is  initiated  and  controlled  by  the  CPU. 

The  operation  of  the  DES  coprocessor  is  depicted  in  Figure  4.5.  The  coprocessor 
remains  idle  until  activated  with  an  instruction  from  the  CPU.  The  instruction  operands 
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are  sent  to  the  coprocessor  along  with  the  opcode  during  an  initial  fetch  cycle,  which  occurs 
in  the  cpuio  state.  Instruction  decoding  is  performed  at  the  start  of  the  execution  state. 
Instructions  requiring  additional  operands  must  repeat  a  fetching  operation,  however  this  is 
done  from  local  coprocessor  RAM  where  unique  logical  process  information  (i.e.  simulation 
time,  number  of  inputs/outputs,  etc.)  received  from  the  CPU  during  initialization  is  stored. 

The  execution  of  each  instruction  requires  of  both  combinational  and  sequential  logic 
functions  within  the  DES  coprocessor.  In  addition  to  instruction  and  accumulator  registers, 
the  DES  coprocessor  design  incorporates  a  32-bit  flag  register  and  ten  general  purpose 
registers  to  support  these  operations.  The  primary  role  of  the  flag  register  is  to  monitor 
the  DES  coprocessor’s  memory  status.  The  flag  fields  reflect  whether  the  CAM  is  full  or 
not  and  the  number  of  events  temporarily  stored  in  RAM. 

Several  functions  of  an  Arithmetic  Logic  Unit  (ALU)  are  also  performed  by  the 
DES  coprocessor.  A  counter  function  for  incrementing  and  decrementing  register  values  is 
included  to  maintain  the  status  of  input  message  channels  and  to  monitor  the  sending  of 
messages  to  output  channels.  Additionally,  the  ability  to  mask  specific  bit  fields  is  provided 
for  combining  of  multiple  data  (i.e.,  LP  node  and  LP  number)  into  single  32-bit  fields  to 
limit  the  storage  requirements  and  reduce  the  number  of  bus  cycles  needed  to  transfer 
information. 

4-4  Design  Implementation 

The  DES  coprocessor  system  design  is  implemented  using  a  VHDL  behavioral  de¬ 
scription  for  each  of  the  functional  blocks  shown  in  Figure  4.2.  Each  block  in  the  design 
is  described  using  one  or  more  VHDL  processes.  i'\2nctions  and  procedures  that  are  per¬ 
formed  multiple  times  with  a  given  process  are  incorporated  within  the  package  body  for 
the  given  functional  block.  Those  functions  that  are  global  in  nature  (i.e.,  required  by  more 
than  one  functional  block)  are  incorporated  in  the  system  package.  A  complete  listing  of 
the  VHDL  source  code  for  the  DES  coprocessor  system  is  included  in  the  appendices. 

4.4- 1  System  Packages  The  system  packages  define  types,  constants,  and  functions 
that  are  required  by  aU  functional  blocks  of  the  DES  coprocessor  design.  Multi-valued  logic 
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Figure  4.5.  Operation  of  the  DES  Coprocessor 


(i.e.,  MVL7),  which,  is  a  standard  type  included  with  the  Zycad  VHDL  system,  was  chosen 
to  represent  signals  within  the  DES  design,  as  this  more  closely  reflects  actual  signal 
values(32:10:17,18). 

Resolution  of  the  DES  coprocessor  data  bus  is  necessary  as  there  are  five  possible 
drivers  for  this  bus  (see  Figure  4.2).  The  bus  resolution  function  is  a  variation  of  the 
wiredX  function  provided  with  Zycad  VHDL(32:10:84-90).  The  DES  coprocessor  design 
differs  from  the  original  Zycad  by  using  function  and  argument  names  that  reflect  the  DES 
coprocessor’s  operation. 

The  need  for  type  conversions  is  inherent  with  the  VHDL  design  language.  The 
DWORD  subtype,  which  is  a  resolved  signal,  is  included  to  avoid  unnecessary  type 
conversions  for  signals  connecting  the  functional  blocks.  Several  functions  contained  in  the 
system  package  (see  Appendix  A)  are  provided  to  support  the  type  conversions  needed  by 
the  various  processes. 

4-4-^  DES  Coprocessor  Behavior  The  behavioral  description  for  the  DES  copro¬ 
cessor  is  the  largest  functional  body  of  the  overall  system  design.  It  incorporates  the 
remaining  package  bodies  and  relies  on  the  functions  they  provide  to  specify  the  total  DES 
coprocessor  design. 

The  coprocessor’s  behavior  (see  Appi'  dbi.  ,B.2)  ■;  ni;eu  v/ivh  hvc  .'ep.'ru’p 

processes:  run,  state,  start,  cpu.^^,  and  execute.  The  run  process  is  included  as  a  primary 
system  driver  and  is  analogous  to  a  chip  select  or  drive  voltage  being  applied  to  the 
coprocessor  chip.  The  inclusion  of  the  run  process  directly  supports  the  testbench  used  to 
verify  the  coprocessor  design. 

The  state  process  provides  the  mechanism  for  state  transitions  as  shown  in  Figure  4.1. 
State  transitions  are  possible  during  both  positive  and  negative  clock  transitions.  The 
progression  between  states  is  somewhat  sequential  in  that  the  coprocessor  must  perform 
an  I/O  state  with  the  CPU  prior  to  entering  the  execute  state.  The  sequential  requirements 
of  the  state  transitions  are  implemented  by  evaluating  the  CPU  control  signal  values  at 
each  clock  transition. 
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4.4-^-i  Start  Process  The  start  process  represents  the  DES  coprocessor’s  state 
when  not  executing  a  CPU  instruction.  Required  maintenance  of  the  coprocessor’s  CAM 
takes  place  while  in  this  state.  The  maintenance  activity  is  based  on  the  simplifying  as¬ 
sumption  that  events  will  arrive  in  time  sequence  after  the  CAM  overflows,  thereby  allowing 
an  ordered  storage  in  the  DES  coprocessor’s  RAM. 

The  DES  flag  register  is  checked  for  CAM  overflow  status.  If  the  CAM  has  experi 
enced  an  overflow,  the  presence  and  number  of  overflowed  events  (i.e.,  events  temporarily 
in  RAM)  is  detected  in  this  register. 

The  overflowed  events  temporarily  in  RAM  are  stored  in  a  first-in-first-out  (FIFO) 
queue.  As  simulation  events  are  executed,  space  becomes  available  in  the  CAM  and  the 
flag  register  is  updated.  During  the  start  process,  if  an  event  is  in  the  RAM,  and  CAM 
space  is  available,  the  events  are  taken  from  the  head  of  the  FIFO  queue  and  passed  to 
the  CAM  for  storage. 

4-4-^-^  CPUJO  Process  The  cpuJo  process  handles  all  incoming  bus  trans¬ 
fers  from  the  CPU  in  an  asynchronous  cycle  regulated  by  the  system  clock  and  the  copro¬ 
cessor’s  assertion  of  the  ready  line  low.  Additionally,  this  process  monitors  the  status  of 
CPU  control  signals  to  determine  if  the  bus  transfer  is  an  instruction  (i.e.,  A2  =  ‘0’)  or 
opex-anc  A2  =  ‘1’). 

During  the  cpuJo  process,  CPU  bus  transfers  are  received  from  the  parallel  input  port 
through  a  buffer  register.  Opcodes  are  routed  to  the  instruction  register  while  operands  are 
stored  sequentially  in  internal,  general  purpose,  32-bit  registers  or  the  coprocessor  RAM, 
depending  on  the  opcode  to  execute. 

Additional  operands,  that  are  unique  to  the  executing  LP,  are  required  for  the  ini¬ 
tialize  simulation  opcode.  Using  the  LP’s  process  number  (i.e.,  0  -  19  from  register  1)  to 
index  the  RAM  partition  pointer  table,  the  base  address  for  storing  additional  operands 
in  memory  is  retrieved. 

The  saving  of  additional  operands,  when  required,  is  done  during  the  CPU  bus  cycle. 
The  CPU  waits  for  the  ready  low  assertion  while  the  cpuJo  process  stores  operands  at  a 
sequential  offset  from  the  LP’s  RAM  partition  pointer. 
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Anticipated  hardware  timing  delays  were  specified  for  required  register  transfers, 
mamory  access  and  read/write  times,  and  the  generation  of  output  control  signals  to  the 
CPU.  As  an  asynchronous  process,  the  CPU  is  required  to  wait  for  the  coprocessor  to 
initiate  termination  of  the  bus  transfer  cycle.  Hence,  the  timing  delays  specified  (see  Ap¬ 
pendix  A)  are  somewhat  arbitrary,  while  attempting  to  satisfy  the  Intel  80386  requirement 
of  two  clock  cycles  per  bus  transfer  (19:5-358). 

At  completion  of  this  process  the  DES  coprocessor  1  as  aU  necessary  operands,  either 
in  general  purpose  registers  or  both  RAM  and  registers,  along  with  the  opcode  to  execute 
in  the  instruction  register. 

4-4-2.3  Execute  Process  The  execute  process  is  entered  immediately  after 
every  epuio  cycle.  The  instruction  register  is  decoded  for  one  of  four  possible  simula¬ 
tion  instructions:  initialize  simulation,  post  message,  get  next  event,  or  post  an  event. 
Once  decoded,  the  appropriate  procedure  is  called  for  execution. 

The  execution  of  simulation  functions  follows  the  DES  coprocessor  operation  outlined 
in  Figure  4.5.  VHDL  procedure  calls,  within  the  execute  process,  are  used  to  implement 
each  of  the  required  simulation  functions. 

4-4‘^'4  Initialize  Simulation  Procedure  A  new  simulation  is  initialized  with 
the  tm<-sim  procedure.  This  procedure  uses  essential  LP  data  (i.e.,  toJp,  Ip.delay,  number 
of  inputs  and  outputs)  stored  in  general  purpose  registers  one  through  four,  and  additional 
operands  (i.e.,  input  and  output  LP  node  and  process  numbers)  to  initialize  the  simulation. 

The  local  simulation  clock  is  reset  and  the  minimum  safe  time  is  calculated.  The 
CAM  is  cleared  of  previous  entries  as  is  the  corresponding  flag  register. 

Null  messages  identifying  the  sending  LP  and  containing  the  “TO_LP”  address  and 
minimum  safe  time  are  routed  to  the  CPU  for  output  via  an  interrupt  request  procedure. 
Null  messages  are  sent  to  every  output-arc  LP  identified  in  the  local  RAM  partition. 
After  all  null  messages  have  been  sent,  the  initsim  procedure  saves  LP  essential  data 
at  the  appropriate  RAM  location  and  a  state  transition  occurs  at  the  next  system  clock 
transition. 
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4.4-S-5  Post  Message  Procedure  The  postjmsg  procedure  supports  the  CPU’s 
requirement  to  receive  both  event  and  null  messages  during  the  discrete  event  simulation. 
The  incoming  message  is  received  in  three  or  four  bus  cycles  from  the  CPU,  depending 
on  the  type  of  message.  Since  null  messages  have  no  real  event  associated  with  them,  no 
pointer  to  the  event  storage  address  in  CPU  memory  is  associated  with  this  message  type. 

Simulation  data,  essential  to  the  receiving  LP  (i.e.,  number  of  in/out  arcs,  arcs-in 
status,  simulation  time,  and  Ip  delay),  are  loaded  from  the  RAM  partition  into  general 
purpose  registers.  The  arcs-in  status  register  is  then  updated  to  reflect  the  receipt  of  an 
input  message  for  the  fromdp  arc. 

CAM  full  status  is  checked,  via  the  flag  register,  and  the  received  message  is  routed 
either  to  the  CAM  or  RAM  swap  space  for  saving.  Prior  to  actual  storage  of  the  message, 
some  compression  of  the  message  fields,  as  described  in  Section  4. 3.4.3,  is  performed  to 
conserve  memory  storage  space  and  minimize  the  number  of  memory  write  cycles. 

4-4-^-^  Get  Event  Procedure  The  get.event  procedure  performs  the  reverse 
operation  of  the  post  message  procedure.  Again  the  essential  data  for  the  executing  LP 
are  loaded  from  RAM  and  the  arcs-in  status  register  is  checked  to  determine  if  an  event  is 
ready  for  the  CPU  or  if  a  wait  for  event  message  must  be  sent. 

If  an  event  is  available  for  execution,  the  coprocessor  retrieves  it  by  asserting  a  CAM 
read  along  with  the  to  Jp  identifier  for  next  event  searching.  The  next  event  from  the  CAM 
is  then  routed  to  the  CPU  for  execution,  while  the  coprocessor  updates  the  LP’s  arcs-in 
register  and  local  simulation  clock.  Additionally,  the  CAM  returns  an  extra  bit  with  the 
next  scheduled  event  which  indicates  if  additional  events  from  the  same  input  arc  remain 
in  the  CAM  for  later  execution. 

4.4-Z.l  Post  Event  Procedure  The  posLevent  procedure  is  executed  after  the 
CPU  has  processed  an  event  action  and  the  subsequent  output  message  is  ready  for  trans¬ 
mission.  This  procedure  builds  the  event  message  for  the  designated  toJp(s)  and  returns 
the  message  to  the  CPU  for  sending  via  the  DCM  module.  In  addition,  this  procedure 
constructs  a  null  message  containing  the  LP’s  identifier  and  the  new  safe  look  ahead  time 
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which  is  routed  via  the  DCM  module  to  all  remaining  output  arcs. 


4.4-S.8  Signal  Multiplexing  The  VHDL  behavioral  design  for  the  DES  copro¬ 
cessor  requires  the  use  of  several  processes,  often  requiring  access  to  the  same  data  (i.e., 
general  purpose  registers)  or  status  signals  (i.e.,  the  coprocessor  readyO  line).  The  need 
for  additional  resolution  functions  was  avoided  by  using  a  signal  multiplexing  technique 
described  by  Armstrong  (2:88,89).  This  approach  permits  multiple  processes  to  drive  sig¬ 
nal  ‘X’  while  the  actual  value  assigned  is  that  which  is  most  recently  applied  (i.e.,  not 
sign  alX ’quiet). 

4-4-3  Parallel  I/O  Behavior  The  behavior  of  the  parallel  I/O  ports  is  borrowed 
from  Armstrong  (2:120-123),  and  only  slightly  modified  for  this  design.  The  behavior  of  3 
buffered  latch  is  implemented  in  this  design.  Control  signals  from  the  coprocessor  and  the 
CPU  are  used  to  activate  combinational  logic  and  *D’  fiip-fiops,  providing  a  bus  latching 
capability  as  well  as  high  impedance  disconnect  from  the  bus  when  not  enabled. 

Output  from  the  parallel  I/O  ports  uses  a  guarded  block  construct,  which  is  de¬ 
pendent  on  the  device  select  and  the  clear  latch  inputs.  Unlike  Armstrong’s  design,  the 
parallel  I/O  ports  of  the  DES  coprocessor  system  lack  an  interrupt  capability  as  it  is  not 
necessary  in  the  coprocessor  design.  Constants  defining  inherent  hardware  timing  delays 
are  included  with  the  generic  map  (see  Appendix  A)  and  are  based  on  bus  transfer  timing 
requirements  of  the  Intel  80386. 

4-4-4  RAM  Memory  Behavior  Enabled) 

The  behavior  of  RAM  memory  is  defined  by  a  memory  model  process.  Two  pro¬ 
cedures,  (do-Tead  and  do.write),  are  called  to  perform  the  basic  memory  functions.  The 
memory  functions  are  activated  with  control  lines  for  I/O  and  read/write  from  the  DES 
coprocessor  while  the  memory  location  is  valid  on  the  local  address  bus. 

The  RAM  organization  is  shown  in  Figure  4.3,  and  this  data  structure  is  implemented 
as  an  array  with  IK  entries,  each  of  which  is  a  32- bit  array  representing  a  memory  location. 
The  RAM  design  incorporates  a  text.io  read  operation  to  load  the  LP  base  address  pointer 
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tEbl6  during  conipon6nt  initisiiZEtion,  Therefore,  when  the  VHDL  design  is  executed,  the 
RAM  is  preloaded  for  the  necessary  memory  operations. 

An  assertion  statement  is  included  as  a  safeguard  with  the  RAM  design  to  indicate 
if  an  address  changes  during  memory  read/write  operations.  Similar  to  other  functional 
blocks  that  drive  the  DES  coprocessor  data  bus,  the  RAM  defaults  to  a  high  impedance 
state  when  no  memory  transactions  are  active. 

Constant  hardware  time  delays  for  memory  access,  reads,  and  writes  are  defined  in 
the  DES  system  package  of  Appendix  A  and  are  established  to  provide  a  timing  basis  for 
functional  block  interaction.  The  timing  delays  were  again  chosen  arbitrarily,  as  actual 
values  depend  on  the  both  technology  employed  for  the  hardware  implementation  and 
whether  or  not  the  RAM  can  be  included  within  the  DES  coprocessor  chip  package. 

4-4-5  CAM  Memory  Behavior  The  Content  Addressable  Memory  behavior  is  also 
implemented  with  a  VHDL  process  construct.  No  requirement  for  an  address  bus  was 
anticipated  nor  included  in  the  design,  as  operations  involving  specific  addresses  are  not 
initiated  by  the  DES  coprocessor. 

The  CAM  event  r-)x‘age  fields  shown  in  Figure  4.4  provide  a  reference  for  the  read 
operation.  The  CAM  algorithm  to  search  for  the  next  scheduled  event  is  implemented  with 
a  sequential  loop  operation  which  differs  from  the  hardware  implementation  which  will  be 
performed  for  all  memory  locations  in  parallel. 

A  read  hit  is  guaranteed  as  the  coprocessor  checks  the  LP’s  arcs-in  status  register 
prior  to  requesting  the  next  event.  The  search  process  is  performed  on  aU  valid  memory 
locations  that  match  the  requesting  LP’s  id.  The  next  scheduled  event  is  determined 
through  a  less  than  comparison  of  the  time  field  of  all  valid  events  that  match  the  requesting 
LP’s  id. 

Once  located  the  next  event  is  parsed  into  three  portions  (i.e.,  to/from  identifier,  time 
tag,  and  ir  emory  pointer)  for  bus  transfer  to  the  DES  coprocessor.  The  most  significant 
bit  of  the  first  transfer  is  set  high  (‘1’)  if  additional  events  from  the  same  sending  LP 
remain  in  the  CAM  for  future  scheduling.  This  information  allows  the  DES  coprocessor 
to  update  the  executing  LP’s  arcs-in  status  register. 
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After  sending  the  next  event  to  the  DES  coprocessor,  the  CAM  process  clears  the 
next  event’s  valid  bit,  to  allow  storage  of  a  received  event  at  that  location.  Similarly,  the 
DES  coprocessor  can  update  the  flag  register,  if  necessary,  to  indicate  available  storage 
space  in  the  CAM. 

CAM  write  operations  are  performed  in  a  straight  forward  fashion.  The  DES  copro¬ 
cessor  writes  events  to  the  CAM  only  when  space  is  available,  based  on  flag  register  status. 
The  received  event  is  inserted  in  the  first  available  storage  location  as  determined  by  the 
CAM’s  sequential  (i.e.,  parallel  in  hardware)  search  of  the  event  fields’  valid  bits. 

4-5  Summary 

The  design  of  a  DES  hardware  accelerator  was  based  on  the  requirements  of  indi¬ 
vidual  processors  rather  than  the  overall  parallel  system  of  the  Intel  iPSC/2  hypercube. 
Bottlenecks  to  the  efficient  execution  of  conservative  discrete  event  simulations  were  found 
to  result  in  significant  CPU  idle  time.  Therefore,  the  purpose  of  a  DES  coprocessor  is  to 
free  the  CPU  from  the  overhead  of  executing  discrete  event  simulation  functions  and  allow 
pending  jobs  immediate  access  to  CPU  execution. 

Operation  of  the  DES  coprocessor  is  controlled  by  the  CPU  when  requested.  It 
functions  as  a  finite  state  machine  performing  I/O  with  the  CPU  and  executing  the  basic 
functions  of  simulation  initialization,  posting  events,  getting  the  next  scheduled  event,  and 
posting  output  messages  when  generated. 

The  conservative  synchronization  protocol  is  implemented  by  the  DES  coprocessor, 
hence  the  overhead  of  simulation  time  management,  next  event  fist  maintenance,  and 
input/output  message  traffic  are  transparent  to  the  CPU. 

The  requirements  for  the  DES  coprocessor  operation  are  specified  in  VHDL.  This 
behavioral  description  of  the  DES  coprocessor  is  implemented  with  process  constructs 
and  procedures  necessary  to  support  the  conservative  synchronization  protocol  for  general 
discrete  event  simulations. 

The  inclusion  of  timing  delays  for  hardware  modeling  is  somewhat  arbitrary.  The 
timing  constraints  of  the  Intel  80386  provided  the  basis  for  selecting  these  timing  values, 
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Actual  timing  values  will,  of  course,  depend  on  the  technology  employed  for  a  hardware 
implementation  and  the  package  size  available  for  the  coprocessor  functional  blocks. 
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V.  DES  Coprocessor  Design  Test 


5. 1  Introduction 

A  high-level  system  description  and  the  required  functional  behavior  of  a  Discrete 
Event  Simulation  coprocessor  were  implemented  using  the  VHSIC  Hardware  Description 
and  Design  Language  (VHDL).  A  complete  source  code  listing  of  this  behavioral  description 
is  included  in  the  appendices. 

Testing  modes  for  the  DES  coprocessor  design  were  limited  since  a  VHDL  design  for 
the  Intel  tPSC/2  and  the  Intel  80386  were  not  available.  Hence,  a  complete  system  test  of 
the  hypercube  architecture  incorporating  the  DES  coprocessor  was  not  feasible,  nor  was  a 
comprehensive  integration  test  with  the  Intel  80386  microprocessor  possible. 

A  VHDL  test  bench  was  designed  to  simulate  the  interface  of  the  Intel  80386  mi¬ 
croprocessor  and  the  DES  coprocessor  on  a  single  hypercube  processing  node.  Testing 
centered  on  verifying  the  DES  coprocessor’s  implementation  of  the  conservative  synchro¬ 
nization  protocol,  discrete  event,  simulation  algorithm.  The  Intel  80386  microprocessor 
was  modeled  as  a  system  driver  addressing  the  DES  coprocessor  as  an  I/O  port.  A  de¬ 
tailed  behavior  of  the  Intel  80386  was  not  necessary  as  only  interface  control  signals  and 
bus  transactions  were  required  to  verify  the  DES  coprocessor’s  operation. 

A  secondary  purpose  of  testing  the  DES  coprocessor  design  was  to  analyze  the 
CPU/DES  coprocessor  interface.  The  DES  coprocessor  was  designed  with  an  asynchronous 
interface  to  the  CPU,  hence  a  timing  analysis  of  device  signals  during  data  transfers  was 
performed. 

The  following  sections  describe  the  test  methodology  and  DES  coprocessor  test  con¬ 
figurations  used  to  analyze  the  coprocessor  design.  Simulation  test  data  that  verify  the 
DES  coprocessor’s  implementation  of  the  discrete  event  simulation  algorithm  are  presented. 
A  summary  of  the  test  data  along  with  logic  analyzer  traces  are  included  for  analysis  of 
the  CPU/DES  coprocessor  interface  and  coprocessor  performance  evaluation. 
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5.2  Design  Test  Methodology 

The  DES  coprocessor  design  is  composed  of  functional  blocks  as  shown  in  Figure  4.2. 
The  behavior  of  each  block  was  designed  to  support  the  implementation  of  the  DES  copro¬ 
cessor  operations  shown  in  Figure  4.5.  The  design  was  tested  to  verify  the  implementation 
of  a  general  discrete  event  simulation  algorithm  using  the  Chandy-Misra  conservative  syn¬ 
chronization  protocol. 

Unit  testing  of  the  functional  components  was  performed  prior  to  overall  system 
testing.  The  DES  coprocessor  component  required  support  from  all  other  functional  com¬ 
ponents  in  the  design,  hence  system  testing  encompassed  the  unit  testing  for  the  DES 
coprocessor. 

Implementation  of  the  basic  discrete  event  simulation  functions  of  initialize,  post- 
message,  get-next-event,  advance-time,  and  post-event  was  verified  during  the  system  test¬ 
ing.  Additionally,  DES  coprocessor  performance,  in  terms  of  execution  time,  and  the 
CPU/DES  coprocessor  interface  were  analyzed  during  system  testing. 

The  DES  coprocessor  system  testing  was  conducted  using  the  carwash  simulation 
configurations  outlined  in  Section  3.2.2.  A  VHDL  testbench  (see  Figure  5.1)  was  designed 
to  test  the  DES  coprocessor  operation  on  a  single  hypercube  processing  node.  The  DES 
coprocessor  was  activated  and  exercised  by  a  testbench  driver  representing  an  Intel  80386 
microprocessor.  The  DES  coprocessor  was  addressed  within  the  testbench  as  an  I/O 
port  connected  to  the  microprocessor.  The  detailed  behavior  of  the  microprocessor  was 
not  necessary  as  the  coprocessor’s  implementation  of  a  general  discrete  event  simulation 
algorithm  could  be  verified  by  tracking  the  coprocessor  state,  values  of  internal  variables, 
and  bus  transfers  with  the  CPU. 

Testbench  timing  was  provided  by  a  system  clock,  similar  to  the  system  clock  re¬ 
quired  for  actual  hardware  operation.  The  testbench  clock  was  operated  at  a  frequency 
half  that  of  the  actual  Intel  80386  system  clock  on  the  iPSC/2.  The  testbench  clock  there¬ 
fore  represented  the  internal  clock  frequencies  of  the  Intel  80386(19:5-347)  and  the  DES 
coprocessor. 
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Figure  5.1.  DES  Coprocessor  Testbench 


Once  activated,  all  testbench  signals  and  variables  were  available  for  monitoring  with 
the  Zycad  VHDL  system.  By  selectively  monitoring  these  variables  and  signals  the  DBS 
system’s  state,  internal  processes,  and  interface  activity  were  analyzed.  Recorded  simu¬ 
lations  of  the  DBS  coprocessor  testbench  execution  provided  the  necessary  data  to  verify 
the  implementation  of  the  DBS  algorithm  and  to  analyze  the  coprocessor’s  performance. 

5.2.1  CPU  Interface  The  CPU/DES  coprocessor  interface  was  tested  to  verify  the 
coprocessor’s  response  to  CPU  requests.  This  testing  involved  tracking  the  status  of  both 
the  CPU  and  DBS  coprocessor  control  lines  and  data  bus  activity  between  the  two  com¬ 
ponents. 

Operation  of  the  DBS  coprocessor  and  subsequent  state  transitions  were  regulated 
by  the  CPU  control  lines  MJO,  WR,  ASTR  and  address  lines  A15  and  A2.  The  copro¬ 
cessor  state  was  determined  by  a  combination  of  these  CPU  control  lines  and,  in  turn, 
feedback  was  provided  to  the  CPU  via  coprocessor  status  lines  INTR,  READYO,  BUSY, 
and  ERROR. 

The  DBS  coprocessor  was  addressed  when  the  CPU  accessed  the  I/O  port  addressed 
by  A15.  The  type  of  bus  transaction  was  determined  by  the  control  line  WR  and  address 
line  A2.  Writing  to  the  coprocessor  occurred  when  WR  =  ‘1’,  while  WR  =  ‘0’  represented 
a  coprocessor  read.  Address  line  A2  was  used  by  the  coprocessor  to  discern  between  opcode 
instructions  (i.e.,  A2  =  ‘0’)  and  operand  data  (i.e.,  A2  =  ‘1’)  during  CPU  write  cycles. 

Bus  transfers  between  the  CPU  and  DBS  coprocessor  were  performed  with  double- 
word  aligned  data  (i.e.,  4  bytes  or  32  bits).  On  coprocessor  writes  the  CPU  asserted  the 
address  strobe  line  (ASTR)  when  bus  data  was  valid(19:5-349).  Conversely,  the  DES  co¬ 
processor  asserted  its  ready  line  (READYO)  to  acknowledge  the  end  of  both  read  and  write 
bus  cycles  with  the  CPU. 

The  asynchronous  timing  of  the  CPU/DES  coprocessor  interface  was  regulated  by 
the  system  clock  in  conjunction  with  the  address  strobe  and  coprocessor  ready  line.  New 
bus  cycles  were  initiated  on  either  positive  or  negative  transitions  of  the  system  clock  and 
ended  on  a  clock  transition  when  the  ready  line  was  asserted  low. 
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5.2.2  General  DES  Algorithm  Functions  The  primary  functions  of  the  DES  al¬ 
gorithm  were  implemented  as  procedures  within  the  behavioral  description  of  the  DES 
coprocessor  (see  Appendix  B.2).  The  DES  function  to  execute  is  deteimined  by  the  op¬ 
code  instruction  issued  by  the  CPU  and  function  execution  results  from  a  procedure  call 
within  the  DES  coprocessor’s  execute  process. 

The  execution  paths  of  the  DES  operation  flow  chart  shown  in  Figure  4.5  correspond 
to  unique  CPU  opcode  instructions.  Execution  of  the  DES  algorithm  functions,  via  VHDL 
behavioral  procedure  calls,  corresponds  to  these  flow  chart  paths. 

5.2.3  Simulation  Initialization  Prior  to  execution  each  processing  node  is  initialized 
with  required  simulation  information.  This  process  is  performed  by  the  SimJnit  procedure, 
within  the  execute  process,  of  the  DES  coprocessor  (see  Appendix  B.2). 

Testing  of  this  basic  DES  function  required  verification  that  the  DES  coprocessor 
stored  the  processing  LP’s  configuration  in  Random  Access  Memory  (RAM)  and  relayed 
the  earliest  time  for  subsequent  messages  to  all  output  LPs  via  “null”  messages. 

Simulation  configurations  with  both  one  and  two  LPs  per  processing  node,  identical 
to  the  carwash  simulation,  were  tested.  Since  the  two  LP  configuration  requires  additional 
support  from  the  DES  coprocessor,  and  the  one  LP  test  is  incorporated  as  a  subset  of  this 
configuration;  the  remainder  of  this  chapter  will  focus  on  the  testing  of  the  two  LPs  per 
processing  node  configuration. 

The  simulation’s  LP  configuration  was  stored  in  predetermined  partitions  within 
the  DES  coprocessor’s  RAM  (see  Figure  4.3).  Access  to  the  designated  partition  was 
determined  via  a  RAM  partition  pointer  table,  which  was  stored  prior  to  coprocessor 
activation,  and  is  referenced  by  the  LP’s  logical  number  (i.e.,  0,  1,  2,  ...).  The  stored 
data  consisted  of  essential  simulation  data  and  the  identities  of  the  interacting  input  and 
output  LPs  (i.e.,  input  and  output  arcs)  of  the  simulation. 

The  first  four  addresses  of  the  LP’s  RAM  partition  store  essential  simulation  data 
for  each  LP  on  the  node.  This  data  includes  the  input  LP  status  (i.e.,  ARCS-IN.STATUS 
register),  the  LP’s  inherent  delay  time,  the  total  number  of  input  and  output  arcs,  and  the 
current  simulation  time  of  the  LP. 
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After  each  LP’s  RAM  partition  is  configured  with  simulation  data,  “null”  messages 
are  sent  to  each  connected  output  LP.  The  sending  of  “null”  messages  requires  interaction 
with  the  CPU  for  access  to  the  interconnection  circuitry  of  the  zPSC/2.  Hence,  the  DBS 
coprocessor  must  issue  an  interrupt  request  (i.e.,  INTR  asserted  high)  to  the  CPU  in  order 
to  send  these  messages.  Each  “null”  message  is  comprised  of  four  doubleword  fields  (i.e., 
T0_LP,  FROMJLP,  SAFE.TIME,  and  a  NULL)  which  are  routed  to  each  output  arc  by 
the  CPU. 

Verification  of  the  simulation  initialization  function  was  achieved  by  monitoring  the 
RAM  partition  of  each  LP  supported  by  the  DES  coprocessor.  End-to-end  transmission 
of  “null”  messages  was  not  possible;  however,  by  monitoring  the  CPU/DES  coprocessor 
interface,  the  required  “null”  message  transmissions  to  all  output  arcs,  via  CPU  interrupt 
requests  and  subsequent  acknowledgements,  were  verified. 

5,2.4  Post  Message  The  scheduling  of  simulation  events  for  an  LP  in  a  distributed 
processing  architecture  is  done  through  a  message  passing  scheme.  Received  messages 
are  not  executed  immediately,  rather  they  are  posted  in  an  event  list  and  scheduled  for 
execution  in  a  time-increasing  order. 

Adding  incoming  messages  to  the  event  list  is  performed  by  the  Postunsg  procedure. 
The  next-event  list  is  maintained  in  a  Content  Addressable  Memory  (CAM)  which  provides 
an  event  retrieval  time  complexity  of  0(1). 

The  message  to  post  consists  of  four  doubleword  fields  containing  the  TO-LP,  the 
FROM-LP,  a  TIME.TAG,  and  a  memory  pointer  to  the  message  in  the  CPU’s  primary 
memory.  Receipt  of  the  message  data  was  verified  by  monitoring  the  bus  transfer  cycles 
with  the  CPU  and  ensuring  the  general  purpose  registers  within  the  DES  coprocessor  were 
subsequently  loaded  with  this  data. 

Actual  posting  of  an  event  to  the  CAM  is  contingent  upon  available  storage  space 
within  the  CAM.  The  CAM  free  space  status  bit  of  the  FLAGS  register  must  be  verified 
prior  to  storage  of  the  event  in  the  CAM.  Verification  of  event  posting  was  accomplished 
by  monitoring  the  CAM  control  lines  and  the  DES  system  bus.  The  contents  of  the  CAM 
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were  also  examined  to  verify  event  posting  and  updating  of  the  event  valid  bit  for  each 
added  event. 

After  posting  a  new  event,  the  CAM  provides  the  DES  coprocessor  with  an  update 
of  its  free  space  status.  The  CAM  status  bit  in  the  FLAGS  register  (i.e.,  bit  0)  is  then 
updated  accordingly.  In  addition  to  verifying  the  updated  FLAGS  register,  the  receiving 
LP’s  input  arcs  status  was  checked  to  ensure  it  reflected  an  input  message  from  the  sending 
LP. 

Posting  of  new  events,  after  the  CAM  capacity  was  reached,  uses  the  DES  RAM 
swap  space  (see  Figure  4.3).  The  DES  coprocessor  was  able  to  discern  this  condition  by 
monitoring  the  CAM  free  space  status  bit  in  the  FLAGS  register. 

5.2.5  Get  Next  Event  The  DES  coprocessor’s  ability  to  retrieve  the  next  scheduled 
event  for  CPU  processing  was  evaluated  by  initiating  a  request  for  a  specific  LP’s  next 
event  from  the  CPU.  In  order  to  execute  the  Get_Event  instruction,  the  DES  coprocessor 
must  first  read  the  specified  LP’s  RAM  partition  and  check  the  arcs  in  status,  to  determine 
if  all  input  arcs  have  posted  an  incoming  message. 

Depending  on  the  input  arcs  status,  the  DES  coprocessor  either  responds  with  a 
wait  for  event  message  to  the  CP U  or  the  next  event  for  the  specified  LP  is  retrieved  from 
the  CAM  and  sent  to  the  CPU.  Next  event  retrieval  was  verified  by  monitoring  the  CAM 
control  lines  and  examining  the  CAM’s  content  to  ensure  the  earliest  valid  event  for  the 
requesting  LP  was  retrieved.  Additionally,  the  valid  bit  for  the  next  event  in  the  CAM 
was  checked  to  ensure  its  status  was  changed  after  providing  the  next  event. 

The  sim.ulation  time  advance  function  is  also  executed  when  the  next  event  is  sent  to 
the  CPU  for  processing.  Advancing  the  LP’s  simulation  time  was  verified  by  monitoring 
the  LP  s  previous  and  updated  simulation  times.  The  time  advance  involves  the  update 
of  the  current  simulation  time  by  advancing  the  simulation  clock  to  the  time  of  the  next 
event’s  scheduled  execution. 

If  the  next  event  to  execute  was  null  (i.e.,  an  event  memory  pointer  of  0)  the  time 
advance  function  still  occurs  and  must  be  verified;  however,  with  this  condition  the  output 
of  a  null  message,  containing  the  new  safe  time,  to  all  output  arcs  must  also  occur. 
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Additionally,  execution  of  another  Get_Event  operation  must  take  place  to  satisfy  the 
CPU’s  outstanding  request  for  the  next  event. 

5.2.6  Post  Event  The  generation  of  event  messages  was  verified  by  sending  sim- 
ul.tion  results  for  posting  from  the  CPU  to  the  DBS  coprocessor.  The  CPU  result  is 
composed  of  the  executing  LP’s  identity,  the  identity  of  the  LP  to  receive  the  result,  and 
a  memory  pointer  addressing  the  event  to  post  in  the  CPU’s  primary  memory. 

Evaluation  of  the  Post_Event  operation  required  verification  that  an  event  message, 
with  an  updated  time  tag,  was  sent  to  the  intended  receiver,  while  “null”  messages  were 
sent  to  all  remaining  output  arcs  of  the  executing  LP.  Verifying  the  execution  of  this 
procedure  required  a  check  of  the  I  P  RAM  partition  read,  for  the  identities  of  all  output 
arcs  connected  to  the  executing  LP.  Additionally,  the  DES  coprocessor’s  ability  to  discern 
between  which  LP  received  the  event  message  vice  those  that  received  “null”  messages  was 
also  verified. 

The  proper  time  tag  for  event  messages  was  verified  as  the  sum  of  the  current  simu¬ 
lation  time  and  the  executing  LP’s  inherent  delay.  Similarly,  the  “null”  message  safe  time 
was  verified  to  be  the  same  value. 

5.3  DES  Coprocessor  Design  Testing 

The  VHDL  oeha’^iortJ  aesign  of  the  DES  coprocessor  system  was  tested  as  outlined 
m  the  previous  section.  Each  component  module  of  the  design  was  tested  separately, 
culminating  in  an  overall  DES  coprocessor  system  test.  Two  configurations  were  used  for 
system  testing.  Initially,  a  single  LP  per  processing  node  was  tested  and  then  the  DES 
coprocessor  system  was  tested  with  two  LPs  sharing  the  same  CPU. 

The  two  LP  configuration  test  encompassed  the  individual  module  tests  and  incor¬ 
porated  the  single  LP  configuration  as  an  integral  part  of  the  test.  Because  of  its  broad 
scope,  the  two  LP  per  processing  node  system  test  was  used  to  evaluate  the  VHDL  design 
and  to  verify  the  implementation  of  the  fundamental  DES  simulation  functions. 

The  two  LPs  per  processing  node  configuration  is  reflected  in  the  logical  process 
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mapping  used  for  the  carwash  simulation  of  Figure  3.4.  The  source  and  wash  LPs  of  node 
0  and  their  interconnections,  with  feedback  from  the  wash  exit,  were  used  to  verify  the 
DES  coprocessor  design. 

Design  verification  was  accomplished  by  analyzing  test  data  provided  by  the  Zycad 
VHDL  system.  Script  outputs,  from  the  testbench  configuration  (see  Figure  5.1),  provided 
signal  values  and  event  times,  while  showing  process  variables  and  state  values  throughout 
the  system  test.  Additionally,  the  Zycad  VHDL  system’s  General  Purpose  Post  processor 
(GPP)  was  used  to  generate  signal  traces  for  selected  system  ports.  The  signal  traces 
provided  a  graphical  representation  of  signal  activity  and  system  variable  values. 

5.3.1  CPU  Interface  The  control  lines  and  asynchronous  bus  cycles  interfacing  the 
CPU  and  DES  coprocessor  were  analyzed  during  execution  of  the  DES  coprocessor 

functions.  One  or  more  CPU  write  cycles  occurred  for  each  DES  coprocessor  function. 
These  write  cycles  demonstrated  the  required  interactions  between  the  CPU  and  DES 
coprocessor. 

The  initiation  of  an  opcode  transfer  bus  cycle  is  shown  below.  The  opcode  for 
the  SimJnit  instruction  is  encoded  in  the  three  most  significant  bits  (i.e.,  “000")  of  the 
doubleword  output  on  the  data  bus.  The  CPU  control  lines,  WR  and  M  JO,  indicated  the 
transmission  of  an  opcode  to  the  DES  coprocessor  (i.e.,  A2  =  ‘0’  and  A15  =  ‘1’),  while  the 
DES  port  values  reflected  the  corresponding  receive  mode  for  this  transfer. 


S  NS 

H94: 


MSS 

MS7 

HS5 

MS6 

M17 

M16 

M14 

M15 


—  CPU  d.ata_bus 

EVENT  /DES_SYS_TEST_BENCH/CPU/DATAOUT  (value  = 
"00000000000000000000000000000000") 

—  CPU  control  signals 

EVENT  /DES_SYS_TEST_BENCH/CPU/A20UT  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/CPU/A150UT  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/CPU/WROUT  (value  =  '!’) 

EVENT  /DES_SYS_TEST_BENCH/CPU/M_I00UT  (value  =  '0') 

—  DES  Coprocessor  port  values 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/CMD0  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS2  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WR  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS1  (value  =  '0') 
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After  one-half  clock  cycle  (i.e.,  62.5  ns  for  8  Mhz),  the  CPU’s  address  strobe  line, 
ASTR,  was  asserted,  indicating  valid  data  on  the  bus.  The  coprocessor  entered  the  I/O 
state,  as  shown  by  monitor  M52,  and  asserted  a  parallel  read  at  139  ns.  The  data  bus  was 
latched,  via  the  parallel  I/O  port  (i.e.,  RDP  =  ‘1’),  and  buffered  in  the  DBS  coprocessor 
at  191  ns.  Loading  of  the  instruction  register  with  the  opcode  was  verified  at  196  ns.  The 
bus  cycle  was  then  terminated  by  the  DBS  coprocessor  asserting  the  ready  status  line, 
RBADYO,  low  at  253  ns. 


62  NS  —  CPU  address  strobe 

M93:  EVENT  /DES_SYS_TEST_BENCH/CPU/ASTROUT  (value  =  '1') 

M13:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/CLK  (value  =  '1') 

M83:  EVENT  /DES_SYS_TEST_BENCE/ CPU/CLOCK  (value  =  '1') 

—  DES  Coprocessor  STATE 

M52:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/CPU_IO  (value  =  TRUE) 

M60:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/IOWAIT  (value  =  '1') 

124  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

139  NS  —  DES  Coprocessor  read 

M4:  EVENT  /DES_SYS_TEST_BENCH/CDPROC/COP/RDP  (value  =  '1') 

159  NS  —  DES  data.bus 

K:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 

"00000000000000000000000000000000") 

186  NS 


M83: 
191  NS 

M72: 

196  NS 

M79: 

201  NS 

M4: 

221  NS 

M; 

248  NS 

M83: 
M60: 
M52: 
253  NS 

M20: 

M90: 


EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  ’!') 

—  DES  Coprocessor  input  buffer 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 
"00000000000000000000000000000000") 

—  DES  Coprocessor  instruction  register 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/IR  (value  = 
"00000000000000000000000000000000") 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RDP  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

EVENT  /DES_SYS_TEST_BENCH/ CPU/CLOCK  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0WAIT  (value  =  '0') 
EVENT  /DES_SYS.TEST_BENCH/C0PR0C/C0P/CPU_I0  (value  =  FALSE) 
—  DES  ready  line  ends  bus  cycle 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/READY0  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/CPU/RDYIN  (value  =  '0') 


A  graphical  representation  of  the  DBS  coprocessor  opcode  read  bus  cycle  is  shown  in 
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Figure  5.2.  This  representation  of  the  first  bus  transfer  cycle  of  the  simulation  is  somewhat 
misleading  in  that,  the  DBS  coprocessor,  because  of  testbench  limitations,  requires  a  system 
clock  transition  at  simulation  startup,  before  entering  the  I/O  state.  Hence,  the  first  half 
clock  cycle  for  the  DBS  coprocessor  is  idle  (i.e.,  until  B,BADY0  =  ‘1’)  at  test  startup. 

This  timing  diagram  shows  a  typical  CPU  to  DBS  coprocessor  write  cycle  The  CPU 
control  lines  access  the  DBS  coprocessor,  while  the  address  strobe,  ASTR,  indicated  valid 
data  on  the  system  bus.  After  latching  the  data,  the  DBS  coprocessor  signaled  the  end  of 
the  bus  transfer  cycle  by  asserting  the  ready  line,  RBADYO,  low. 


=10C/C0P/BUFF_I0,  , 

3/COPR^/c8p/IRi, 

3R0C /cap /READ W  c 
Tag  No.  21  ® 
COPROC/COP/RCP 
Tag  No.  5  ° 
3CH/CPU/DflTflOLrr 
Tag  No.  214  . 
3CH/CP0/RSTR0Lrr 

3EMcS58=!]5fl2§il 

JMCH/cftj/HiSOUT  J 

3CH«ioeSf 

Tag  No.  206  - 
3ENCH/CPU/mOUr  d 
Tag  No.  205  ■ 
HcoPRoc/cop/cLK  ; 

Tag  No.  14  ; 

- 3 - 

00000000 

- 1 - 1 

- 1 - 

00000000 

. _ T - 

1 - 

j - —  =i 

1 - 

- - 

1 

1 

1 

1 

1 

Figure  5.2.  Opcode  Read  Bus  Cycle  of  DBS  Coprocessor 


5.3.2  Simulation  Initialization  Function  Bach  simulation  logical  process  on  a  pro¬ 
cessing  node  was  configured  with  essential  data  defining  the  overall  simulation  configura¬ 
tion.  After  sending  the  opcode  instruction,  the  CPU  transferred  the  required  initialization 
operands  to  the  DBS  coprocessor.  An  excerpt  of  trace  data  from  the  system  test,  show¬ 
ing  the  first  IP’s  (i.e.,  node  0  source  LP,  from  Figure  3.4)  receipt  of  the  fourth  Sim  Jnit 
operand  (i.e.,  number  of  output  LPs),  Is  listed  below. 

The  coprocessor’s  general  purpose  registers  were  loaded  with  the  operands  for  the 
initialization  function  (see  monitor  M68).  Register  1  was  loaded  with  the  TO.LP  (i.e., 
node  0,  LP  0),  while  the  deterministic  LP  delay,  the  number  of  input  arcs,  and  the  number 
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of  output  arcs  (i.e.,  4,  2,  and  3  respectively)  were  loaded  into  registers  2  through  4.  The 
remaining  general  purpose  registers  have  yet  to  be  used  this  point  in  the  simulation,  hence 
their  values  were  unknown  (i.e.,  ‘X’). 


1131  NS 
M4: 

1151  NS 

M: 


1178  NS 


—  DES  Coprocessor  read  line 
EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDP  (value  =  '1') 
—  DES  Coprocessor  input 

EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 
"00000000000000000000000000000011") 


M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

—  DES  Coprocessor  input  buffer 
M72:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 

"00000000000000000000000000000011 " ) 


—  Sim.Init  operands  loaded  in  Registers  1  to  4 
M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

—  T0_LP  (i.e.,  node  0,  LP  0)  —  LP  delay 

( "00000000000000000000000000000000" , "00000000000000000000000000000100" , 
number  of  input  arcs  —  number  of  output  arcs 

"00000000000000000000000000000010" , "00000000000000000000000000000011" 
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ' 
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" , 
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 


Figure  5.3  shows  a  GPP  trace  of  the  general  purpose  registers  loaded  with  the  primary 
Sim  Jnit  operands.  This  figure  shows  the  register  loading  sequence  of  buffering  the  system 
data  bus  and  loading  the  operands  sequentially  into  the  general  purpose  registers.  This 
LP  specific  data  was  used  to  initialize  the  simulation,  therefore  it  was  maintained  in  local 
registers  while  executing  the  Sim  Jnit  procedure  and  stored  in  the  LP’s  RAM  partition 
after  function  execution. 

The  add..esses  of  the  input  and  output  LPs,  for  the  SimJnit  operation  were  also 
provided  by  the  CPU;  however,  since  a  maximum  of  20  input /output  arcs  are  possible,  these 
additional  operands  were  stored  in  the  LPT  RAM  partition  and  read  when  needed.  The 
DES  coprocessor’s  local  data  bus,  the  DES  coprocessor/RAM  interface,  and  the  LP’s  RAM 
partition  were  monitored  to  ensure  the  proper  storage  of  the  input/output  arc  operands. 

Test  data  showing  the  receipt  and  storage  of  the  LP’s  first  input  arc  in  the  RAM 
partition  follows.  Monitor,  M72,  shows  an  input  to  LP  0  on  node  0  from  itself  (i.e., 
doubleword  representing  LP  #  and  node  #).  This  reflects  the  operation  of  a  source  LP 
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Figure  5.3.  DE3  Coprocessor  Registers  with  Sim  Jnit  Operands 


generating  a  new  car  event  in  the  carwash  simulation.  The  operand  was  temporarily  held 
in  general  purpose  register  8,  while  the  RAM  partition  base  address  was  read  from  index 
‘0’  (i.e.,  LP  0)  of  the  RAM  partition  pointer  table.  After  reading  the  RAM  partition  base 
address  (i.e.,  544io),  it  is  temporarily  held  in  general  purpose  register  9  and  the  RAM  write 
cycle  followed. 


1431  NS  —  LP's  1st  input  airc 

M7?:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 

"00000000000000000000000000000000") 

1436  NS 


MSS:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

("00000000000000000000000000000000", "00000000000000000000000000000100", 
"00000000000000000000000000000010", "00000000000000000000000000000011" , 
"xxxxxxxxxxxxxxxxxxxxxmxxxxxxxx" ,  "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"  i 

—  operand  in  (Register  8) 

"XXXXXXXXXXXXXXXXUXXXXXXXXXXXXXX", "00000000000000000000000000000000", 
"XXXX..  > "XXXXXXXXXXXXXXXXXXXXXXXXX" ,  "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"  )  ) 

1441  NS 

M4:  EVENT  /DES_SYS_TEST_BENCH/C0PR00/C0P/RDP  (value  =  '0') 

1451  NS  —  DES  ram  read  signals 

Mil:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0  (value  =  '1') 

M12:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RW  (value  =  '0') 

1461  NS 


M:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

1466  NS  —  RAM  read  address 

M2:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/MA  (value  =  "COOOOOOOOOO") 


—  LP  RAM  partition  base_address 
M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"00000000000000000000001000100000") 

1488  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

1501  NS  —  base_address  to  input  buffer 

M72:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/BUFF_IO  (value  = 

"00000000000000000000001000100000") 

1506  NS 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

("00000000000000000000000000000000", "00000000000000000000000000000100", 
"00000000000000000000000000000010", "00000000000000000000000000000011" ' 
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" , "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" i 
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" , "00000000000000000000000000000000" , 
—  LP's  base_address 

"00000000000000000000001000100000" , "XXXXXXXXXXXXXXXXXIXXXXXXXXXXXXXX" ) ) 


The  RAM  write  cycle,  storing  the  first  input  arc  for  LP  0,  is  reflected  in  the  following 
excerpt  of  test  data.  The  received  operand  is  put  onto  the  DBS  system  data  bus  at  1541  ns 
while  the  RAM  partition  address  and  read  control  line  are  asserted  at  1551  ns  and  1566  ns 
respectively.  As  designed,  the  first  input  operand  was  stored  at  an  offset  of  four  (i.e.,  548io) 
from  the  RAM  partition  base  address,  leaving  space  for  the  operands  remaining  in  general 
purpose  registers  one  through  four.  The  RAM  partition  address  was  then  incremented  by 
one  and  saved  in  general  purpose  register  six  for  additional  write  cycles. 


1516  NS 
Mil: 
M: 

1521  NS 
M72: 

M12: 
1536  NS 
Mil: 
1541  NL 
Ml: 

1550  NS 
M13: 

1551  NS 

M2: 

1566  NS 
M12: 
1581  NS 


EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

—  buffer  data  for  RAM 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 
"00000000000000000000000000000000") 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RW  (value  =  'Z') 

—  DES  Coprocessor  I/O  operation 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0  (value  =  '1') 

—  DES  Coprocessor  data_bus 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATA0UT  (value  = 
"00000000000000000000000000000000") 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/CLK  (value  =  '1') 

—  RAM  write  address 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/MA  (value  =  "01000100100") 
—  RAM  write  signal 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RW  (value  =  ’1’) 
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M68:  EVENT  /DES_SYS_TEST_BENCH/CDPRDC/CDP/REG_32  (value  = 

("00000000000000000000000000000000". "00000000000000000000000000000100", 
"00000000000000000000000000000010" . "0000000000000000000000000000001 1 " , 

—  next  RAM  partition  addr 

"XXXXXXXXXXXXXXXXXXXXXmXXXXXXXX",  "00000000000000000000001000100101", 

"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" . "00000000000000000000000000000000" i 
—  LP's  base_addr 

"00000000000000000000001000100000" . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 

1596  NS 

M2:  EVENT  /DES_SYS_TEST_BENCH/CQPROC/CDP/MA  (value  =  "222Z2ZZZ2Z2" ) 

Mil:  EVENT  /DES_SYS_TEST_BENCH/COPRDC/CDP/ID  (value  =  ’O’) 

Figure  5.4  is  a  GPP  logic  diagram  of  the  DBS  coprocessor  write  operation  described 
above.  The  assertion  of  the  read  control  line  occurs  after  both  data  and  the  RAM  memory 
address  are  stable,  allowing  the  RAM  to  read  and  store  the  input  arc  identity  for  later 
retrieval.  Storage  was  performed  in  consecutive  RAM  partition  locations  by  incrementing 


the  memory  address  between  successive  write  operations 


Figure  5.4.  DBS  Coprocessor  Write  to  RAM  Partition 


The  sending  of  “null”  messages  to  all  output  arcs  was  the  final  action  of  the  Sim  Jnit 
function.  This  operation  was  verified  by  monitoring  the  DBS  coprocessor/CPU  interface 
to  ensure  complete  messages,  with  accurate  safe  times,  were  routed  through  the  CPU  to 
each  output  arc. 

A  sample  of  one  portion  (i.e.,  sending  of  the  TO-LP  field)  of  a  typical  “null”  message 
send  is  reflected  in  the  following  test  data  excerpt.  The  DBS  coprocessor  asserted  an 
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interrupt  request  to  send  the  first  of  four  “nuU”  message  fields  (i.e.,  TO.LP,  FROM  .LP, 
Safe-Time,  Null)  at  3154  ns.  The  CPU,  in  turn,  serviced  the  interrupt  with  a  coprocessor 
read  (i.e.,  LP  1  on  node  1  at  3217  ns)  which  was  terminated  by  the  coprocessor  asserting 
the  READYO  line  low  at  3247  ns.  After  receiving  all  required  inputs,  the  CPU  is  able  to 
build  an  output  message  and  route  it  accordingly,  via  the  interconnection  circuitry  of  the 
Intel  tPSC/2  hypercube. 


3154  NS 
M3: 
M89; 
3162  NS 
MSS: 
3172  NS 
MSS: 
MS7: 
MS6: 
M14: 
M16: 
M16: 
31S7  NS 
M5: 
M3: 
Ml: 


M89: 
3217  NS 
M95; 

3224  NS 
MSS: 
3247  NS 
MS: 


M20: 

M90: 


—  CPU  interrupt  request 

EVENT  /DES_SYS_TEST_BENCE/COPROC/COP/INTR  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/CPU/INTIN  (value  =  '1') 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  >1>) 

—  CPU  acknovledge  via  control  signals 
EVENT  /DES_SYS_TEST_BENCH/CPU/WROUT  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/CPU/A150UT  (value  =  '1') 

EVENT  /DES_SYS_TEST_BENCH/CPU/M_IOOUT  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WR  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS2  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS1  (value  =  '0') 
—  DES  Coprocessor  write 

EVENT  /DES_SYS_TEST_BENC1/C0PR0C/C0P/WTP  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/INTR  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATA0UT  (value  = 
"00000000000000010000000000000001")  “  T0_LP 
EVENT  /DES_SYS_TEST_BENCH/CPU/INTIN  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/CPU/DATAIN  (value  = 
"00000000000000010000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WTP  (value  =  '0') 

—  DES  Coprocessor  ends  bus  cycle 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/READY0  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/CPU/RDYIN  (value  =  '0') 


5.5.5  Post  Message  Function  The  posting  of  an  incoming  message  involved  all  com¬ 
ponents  of  the  DES  coprocessor  design.  The  parallel  I/O  ports  received  the  necessary 
operands  from  the  CPU,  while  related  LP  data,  stored  during  initialization,  was  held  in 
the  coprocessor’s  RAM,  and  the  next  event  list  was  maintained  in  the  coprocessor’s  CAM. 

For  testing,  several  messages,  both  “null”  and  real,  were  posted  to  each  LP’s  next 
event  list  in  the  CAM.  The  CPU  memory  pointers  used  for  testing  were  arbitrarily  cho- 
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sen  to  monitor  message  field  changes.  Therefore,  the  memory  pointers  have  no  direct 
correspondence  to  actual  memory  addresses.  Loading  of  the  opcode  instruction  for  this 
procedure,  and  all  DES  coprocessor  simulation  functions,  is  identical  to  that  of  the  SimJnit 
instruction  previously  described. 

Operands  for  the  Post_Msg  function  included  the  four  fields  of  the  message  to  post 
(i.e.,  TO_LP,  FROMJiP,  Time.Tag,  and  a  memory  pointer  to  the  event).  Therefore,  four 
CPU  bus  write  cycles  were  required  to  load  the  message  fields.  The  last  bus  transfer  cycle, 
containing  the  CPU  memory  pointer  to  the  event,  is  shown  below. 


9920  NS  —  CPU  address  strobe 

M93:  EVENT  /DES_SYS_TEST_BENCH/CPU/ASTRDUT  (vzlne  =  '1') 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

9935  NS  —  DES  Coprocessor  read 

M4:  EVENT  /DES_SYS_TEST_BENCH/COPRDC/COP/RDP  (value  =  '1') 

9955  NS  —  DES  Coprocessor  input 

M:  EVENT  /DES_SYS_TEST_BENCH/COPRDC/CDP/DATAIN  (value  = 

"11101011101011101011101011101011") 

9982  NS 

M83;  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

9987  NS  —  DES  Coprocessor  input  buffer 

M72:  EVENT  /DES_SYS_TEST_BENCH/COPRDC/CDP/BUFF_ID  (value  = 

"11101011101011101011101011101011") 

9992  NS 


M68:  EVENT  /DES_SYS_TEST_BENCH/C0PRDC/CDP/REG_32  (value  = 

—  T0_LP  —  FRDM.LP  Node  3  &  LP  1  (exit) 

("00000000000000000000000000000000", "00000000000000110000000000000001", 
event  Time_Tag  —  event  memory  pointer 

"00000000000000000000000000001111", "11101011101011101011101011 101011", 
—  current  data  in  remaining  registers 
"00000000000000000000000000000000" , "00000000000000000000001000 111110", 
"00000000000000110000000000000001" , "00000000000000 llCOOOOOOOOOOOOOOl" , 
"000000000000000000000010001 1 1000  ' . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 

9993  NS 

M72:  EVENT  /DES_SYS_TEST_BENCH/COPRDC,'CDP/BUFF_ID  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

9997  NS 

M4:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/CDP/RDP  (value  =  '0') 

10017  NS 

M:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/CDP/DATAIN  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

10044  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLDCK  (value  =  '0') 

M60:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/CDP/IDWAIT  (value  =  '0') 

M52:  EVENT  /DES_SYS_TEST_BENCH/C0PRDC/C0P/CPU_IQ  (value  =  FALSE) 
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10049  NS  —  DBS  Coprocessor  ends  bus  cycle 

M20:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/READY0  (value  =  '0') 

M90:  EVENT  /DES_SYS_TEST_BENCH/CPU/RDYIN  (value  =  '0') 

M94:  EVENT  /DES_SYS_TEST_BENCH/CPU/DATAOUT  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

The  DES  coprocessor  has  read  and  buffered  the  fourth  field  of  the  message  to  post 
at  9987  ns.  After  transfer  to  a  general  purpose  register,  the  complete  message  to  post  was 
contained  in  general  purpose  registers  one  through  four  as  reflected  in  monitor  M68. 

Prior  to  storing  the  event,  the  DES  coprocessor  compared  the  FROM.LP  identity, 
in  register  two,  with  the  input  arc  identities  maintained  in  the  RAM  partition,  to  update 
the  LP  s  arcs-in  status.  This  was  accomplished  through  a  sequential  read  of  the  input  arcs 
from  the  RAM  partition,  until  a  match  was  found. 

Reading  of  input  LPs  from  RAM  is  shown  below.  The  test  data  indicated  that 
the  second  input  arc,  stored  at  RAM  address  549io,  was  the  sending  LP  for  the  received 
message.  Monitor  M68  at  10666  ns,  reflects  the  TO-LP’s  arcs-in  status  in  general  purpose 
register  six.  The  status  was  updated,  by  setting  the  second  least  significant  bit,  as  a  result 
of  the  input  LP  match.  Following  the  arcs-in  status  update,  the  new  arcs-in  status  was 
saved  in  the  LP’s  RAM  partition  for  later  use  with  the  Get_Event  function. 


10601  NS 
Mil: 

10602  NS 
N83: 

10616  NS 
M2: 

VM0N5 : 

M: 

10651  NS 
M72: 


10664  NS 
N83: 
10666  NS 
N64: 
M2: 


—  DES  Coprocessor  I/O  operation 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0  (value  =  '1') 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

—  RAM  read  address  (549) 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/MA  (value  =  "01000100101") 
—  RAM  data  equals  input  arc  ID 
READ  /DES_SYS_TEST_BENCH/C0PR0C/MEM/P/M(549)  (value  = 
"00000000000000110000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 
"00000000000000110000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/BUFF_IO  (value  = 
"00000000000000110000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

—  RAM  matches  input_arc  ID 

EVENT  /DES_SYS_TEST_BENCH/COPR0C/C0P/MATCH  (value  =  TRUE) 

EVENT  /DES_SYS_TEST_BENCH/COPR0C/C0P/MA  (value  =  "ZZZZZZZZZZZ") 
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Mil;  EVEl'T  /DES_SYS_TESf_BENCH/COPROC/COP/IO  (value  =  '0') 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

( "00000000000000000000000000000000" . "0000000000000011000000000000000 1 " , 
"00000000000000000000000000001111",  "1110101110101 1101011101011101011",' 

—  2nd  ARC_IN  updated 

"00000000000000000000001000100000" . "00000000000000000000000000000010" , 
"00000000000000000000000000000100". "00000000000000100000000000000011"! 
"00000000000000000000000000000000" . "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 
M:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

10671  NS  —  DES  Coprocessor  ends  RAM  read 

M12:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RW  (value  =  'Z') 

10686  NS 

Mil:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/I0  (value  =  '1') 

10691  NS  —  update  ARCS_IN_STATUS  in  LP  RAM  partition 

Ml:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAOUT  (value  = 

"00000000000000000000000000000010") 


Posting  of  the  message  to  the  next  event  list  involved  three  bus  transfers  to  the  CAM. 
During  the  first  cycle  the  TO  and  FROMXP  fields  of  the  event  were  sent  to  the  CAM  (see 
CAM  fields  in  Figure  4.4).  The  event  Time-Tag  and  pointer  to  CPU  memory  followed  in 
subsequent  doubleword  bus  transfer  cycles. 

Test  data  verifying  the  third  CAM  write  bus  cycle  is  shown  below.  The  event  memory 
pointer  from  register  four  was  written  to  the  CAM  and  appended  to  the  to  the  previous 
data  to  complete  the  next  event. 


11001  NS  —  Event  memr_ptr  assigned  by  CPU 

M72:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 

"11101011101011101011101011101011") 

11016  NS  —  DES  Coprocessor  data_bus 

Ml:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATA0UT  (value  = 

"11101011101011101011101011101011") 

11031  NS  —  CAM  write  signal 

M8:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WTC  (value  =  '1') 

—  NOT  get_next_event 

M9:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NE  (value  =  '0') 

VM0N15:  READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT(1)  (value  = 

"10000001100001  —  T0/FR0M_LP 

00000000000000000000000000001111  —  Time.Tag 

11101011101011101011101011101011")  —  Memr.Ptr 

The  most  significant  bit  of  the  posted  event  in  the  CAM  was  set  to  indicate  a  valid 

event.  The  posted  event  represents  a  feedback  message  from  the  carwash  exit  (i.e.,  node  3 
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LP  1  from  Figure  3.4),  scheduled  for  the  simulation  time  of  15  units.  The  event  memory 
pointer  indicates  the  event  data  structure  is  stored  in  CPU  main  memory  at  address 
EBAEBAEBh. 

After  posting  the  new  event  to  the  next  event  list,  the  CAM  returned  a  free-space 
status  to  the  DES  coprocessor.  The  DES  coprocessor  maintains  the  CAM  free-space  status 
in  the  FLAGS  register,  bit  (0),  which  is  read  to  determine  if  CAM  overflow  has  occurred. 

5.3.4  Get  Next  Event  Function  The  Get  JBvent  function  was  verified  by  monitoring 
the  DES  coprocessor’s  use  of  the  requested  LP’s  arcsin.status  register  and  ensuring  the 
proper  action  (i.e.,  wait  message  or  retrieval  of  next  event)  was  taken.  To  verify  the  proper 
next  event  was  provided,  the  contents  of  the  CAM  were  checked  both  before  and  after 
event  retrieval. 

The  time  advance  of  the  LP’s  simulation  clock  was  also  verified  during  this  portion 
of  system  testing  by  ensuring  the  LP’s  simulation  clock  was  updated  with  the  next  event 
ti  ne.  Additionally,  the  sending  of  “null”  messages  to  all  output  arcs,  which  is  required 
when  the  next  scheduled  event  contains  a  “null”  memory  pointer,  was  verified. 

The  DES  coprocessor’s  execution  of  the  Get_Event  instruction  requires  additional 
operands  (i.e.,  arcsJn_status  and  simulation  time)  which  were  read  from  the  LP’s  RAM 
partition.  Below  is  test  data  that  verifies  the  RAM  partition  was  accessed  to  retrieve  the 
necessary  LP  data  for  Get-Event  execution. 

The  LP  requesting  the  next  event  was  provided  by  the  CPU  along  with  the  opcode 
instruction  and  was  maintained  in  general  purpose  register  one.  Using  the  LP  number  as  an 
index,  the  DES  coprocessor  accessed  the  proper  RAM  partition,  via  the  partition  pointer 
table.  The  contents  of  four  RAM  addresses,  starting  with  the  partition  base  address,  were 
read  and  used  to  update  registers  three  through  six  with  the  essential  LP  data  (i.e.,  arcs-in 
status,  LP-Delay,  number  of  input/output  arcs,  and  Simulation-Time). 

An  excerpt  of  the  system  test  results  below,  show  the  coprocessor  reading  the  LP’s 
simulation  time  from  the  RAM  partition  (i.e.,  memory  address  547io)  and  updating  the 
general  purpose  registers.  The  contents  of  register  three  indicated  that  all  input  arcs  (i.e.. 
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two  for  LP  0  on  node  0)  have  posted  messages;  therefore,  the  next  event  for  LP  0  may  be 
retrieved  from  CAM  memory. 


—  LP  0  executing  Get_Event 

—  RAM  address  of  LP  simulation  time 
EVENT  /DES_Sys_TEST_BENCH/COPROC/COP/MA  (value  =  "01000100011") 
READ  /DES_Sys_TEST_BENCH/C0PR0C/MEM/P/M(547)  (value  = 
"00000000000000000000000000000000") 

EVENT  /DES_syS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 
"00000000000000000000000000000000")  ~  LP  simulation  time 

EVENT  /DES.P'i;S_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

EVENT  /DES_SyS_TEST_BENCH/C0PR0C/C0P/BUFF..I0  (value  = 
"00000000000000000000000000000000") 

EVENT  /DES_sys_TE£T_BENCH/CuPR0C/CDP/REG_32  (value  = 

0  —  RAM  partition  base_addr  (544) 

("OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO" , "OOOCOOOOOOOOOOOOOOOOOOIOOOIOOOOO" , 
—  ARCS_IN_STATUS  register  —  LP  delay  (4) 

"00000000000000000000000000000011" . "00000000000000000000000000000100" , 
—  #  input  I  #  output  arcs  —  LP  simulation  time 

"OOOOOOOOOnOOOOlOOOOOOOOOOOOOOOll" , "00000000000000000000000000000000" , 

"00000000000000000000000000000100" , "0000000000000010000000000000001 1 " i 

"OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO" . "XXXXXnXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 

17028  NS 

M2:  EVENT  /DES_Sys_TEST_BENCH/COPROC/COP/MA  (value  =  "ZZZZZZZZZZZ") 

Mil:  EVENT  /DES_SyS_TEST_BENCH/COPROC/COP/IO  (value  =  '0') 

H:  EVENT  /eES_syS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"2ZZZZZZZ7XZ2ZZZZZZZZZZZZZZZZZZZZ") 


16978  NS 
M2: 
VM0N3 : 

Mr 

16988  NS 
Mg3: 
17013  fS 
M72: 

1T018  NS 
M68: 


The  DES  copwcessor  then  retrieved  the  next  even'  from  the  CAM,  via  a  CAM  read 
cycle;  however,  the  CAM  was  first  given  the  identity  jf  the  requesting  LP  to  perform  a 
search  for  the  earliest  scheduled  event  for  that  LP. 

The  system  test  data  th^-t  verifies  a  n’:xt  event  read  from  CAM  is  shown  below. 
Initially,  the  CAM  was  given  tht  identity  of  the  LP  requesting  the  next  event.  The  ‘NE’ 
control  line  alerted  the  CAM  that  a  nt  7  event  request  was  initiated,  vice  appending  a  new 
event  to  the  next  event  list. 

The  contents  of  the  CAM  were  checked,  via  monitors  15  through  17,  which  indicated 
three  valid  events,  two  of  which  were  for  LP  0.  At  time  17113  ns,  the  CAM  selected  the 
next  event  for  LP  0  by  toggling  the  valid  bit  of  the  third  event  in  the  CAM.  The  first  CAM 


5-21 


event  (i.e.,  VM0N15)  is  also  scheduled  for  LP  0;  however,  the  scheduled  time  of  15  units 
occurs  later  than  the  selected  next  event  which  is  scheduled  for  10  units. 


17093  NS 
Ml: 


17108  NS 
M8 
M9 

VM0N15 


VM0N16: 


VM0N17: 


17112  NS 
M83; 

17113  NS 
VM0N15: 


VM0N16: 


VM0N17: 


—  LP  0  executing  Get_Event 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATA0UT  (value  = 
"00000000000000000000000000000000") 

—  LP  0  requesting  next_event 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WTC  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NE  (value  =  '1') 
READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT ( 1 ) 

(value  =  "10000001100001  —  T0/FR0M_LP 

00000000000000000000000000001111  —  Time.Tag 

11101011101011101011101011101011")  —  Memr.Ptr 

READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT(2) 

(value  =  "10000100000000  —  T0/FR0M_LP 

00000000000000000000000000010100  —  Time.Tag 

11001100110011001100110011001100")  —  Memr.Ptr 

READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT(3) 

(value  =  "10000000000000  —  T0/FR0M_LP 

00000000000000000000000000001010  —  Time.Tag 

11110011110011110011110011110000")  —  Memr.Ptr 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT ( 1 ) 

(value  =  "10000001100001 
00000000000000000000000000001111 
11101011101011101011101011101011") 
READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT(2) 

(value  =  "10000100000000 
00000000000000000000000000010100 
11001100110011001100110011001100") 
READ/DES_SYS_TEST_BENCH/C0PR0C/CAM/CAM_PR0C/EVENT ( 3 ) 

(value  =  "00000000000000  ^  —  toggle  "valid"  bit 

ooooooooooooooooooooooooobooiolo 

11110011110011110011110011110000") 


The  CAM  event,  which  consists  of  77  bits,  was  then  parsed  ir.tn,  thren  -.'j!m?nts  a  ! 
sent  to  the  DES  coprocessor  for  submission  to  the  CPU.  Thj  m^ist  bit  v 

first  segment  (i.e.,  TO/FROM-LP  identities)  is  used  by  GaM  “io  th;.*  additional 

events,  received  from  the  same  LP,  still  remain  in  the  C.AM.  di.'.'e  adciUoual  eve., 
from  the  same  LP,  remain  in  the  CAM,  this  bit  h  not  set  which  ele.ac  th'  rilS  coprocessor 
to  update  the  arcs-in  status  register  of  the  requesting  LP. 
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17168  NS  —  CAM  read  signal 

M7:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDC  (value  =  '1') 

M9:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/NE  (value  =  >Z>) 

M8:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/WTC  (value  =  '0') 

—  event  TQ/FR0M_LP  (source  generates  car) 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"00000000000000000000000000000000") 

17174  NS 

M83:  EVENT  /DES_SYS_TEST_BEHCH/CPU/CLOCK  (value  =  '1') 

17223  NS 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

( "00000000000000000000000000000000" . "00000000000000000000001000100000" , 
"00000000000000000000000000000011", "00000000000000000000000000000100" , 
"00000000000000100000000000000011" . "00000000000000000000000000000000" , 
—  event  to/from_lp 

"00000000000000000000000000000000" , "000000000000001000000000000000 11", 
"00000000000000000000000000000000", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")) 

17233  NS 

M7:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDC  (value  =  '0') 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

17236  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

17248  NS 

M7:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDC  (value  =  '1') 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"00000000000000000000000000001010") 

17298  NS 

M72:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/BUFF_IO  (value  = 

"00000000000000000000000000001010")  —  event  Time.Tag 
M33:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

17303  NS 

H68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

("00000000000000000000000000000000", "00000000000000000000001000 100000" , 
"00000000000000000000000000000011", "00000000000000000000000000000100" , 
"00000000000000100000000000000011" , "00000000000000000000000000000000" , 
--  event  to/from_lp  —  event  time_tag 

"00000000000000000000000000000000", "OOOOOOOOOOOOOOOOOOOOOOOOOOOO 1010" , 
"00000000000000000000000000000000" , "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" ) ) 

17313  NS 

M7:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDC  (value  =  '0') 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

17328  NS 

M7:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/RDC  (value  =  '1') 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"11110011110011110011110011110000") 

17360  NS 

H83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

17378  NS  —  Event  memory  pointer  assigned  by  CPU 

M72:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/BUFF_IO  (value  = 
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"11110011110011110011110011110000") 

17383  NS 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

("00000000000000000000000000000000", "00000000000000000000001000100000"  , 
"00000000000000000000000000000011" . "00000000000000000000000000000100" , 
"00000000000000100000000000000011" , "00000000000000000000000000000000"  , 

—  event  to/from_lp  —  event  time_tag 

"00000000000000000000000000000000" , "00000000000000000000000000001010" , 

—  event  memr_ptr 

"11110011110011110011110011110000". "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")) 


Updating  of  the  arcs-in  status  register  was  accomplished  by  a  sequential  read  of  the 
input  arc  identities  from  RAM  and  searching  for  the  LP  that  provided  the  next  event.  At 
17393  ns,  general  register  10  has  been  loaded  with  the  next  event’s  input  LP  identity.  This 
was  extracted  from  the  first  event  segment  sent  by  the  CAM,  and  was  used  as  a  reference 
for  the  LP  identities  read  from  RAM. 


17393  NS 

M7:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/RDC  (value  =  '0') 

M:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAIN  (value  = 

"ZZZZ2ZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

( "00000000000000000000000000000000" , "00000000000000000000001000100000" , 
"00000000000000000000000000000011" , "00000000000000000000000000000100" , 
"00000000000000100000000000000011" , "00000000000000000000000000000000" , 

—  event  to/from_lp  —  event  time_tag 

"00000000000000000000000000000000" , "00000000000000000000000000001010" , 

—  event  memr_ptr  —  next  event  input  LP 

"11110011110011110011110011110000", "00000000000000000000000000000000")) 

17422  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 


5.3.5  Post  Event  Function  The  Post-Event  function  was  verified  by  checking  that 
the  DES  coprocessor  received  the  event  to  post  from  the  CPU  and  combined  the  simulation 
time  with  the  inherent  LP  delay  to  schedule  the  event’s  occurrence.  To  verify  posting  of 
the  event,  the  CPU/DES  coprocessor  interface  was  monitored  to  ensure  that  all  output 
arcs  received  a  message.  The  designated  TO-LP  was  verified  to  receive  the  event  message 
while  all  other  LPs  received  a  “null”  message  (i.e.,  memory  pointer  to  event  was  0). 

Similar  to  all  other  DES  functions,  the  opcode  instruction  and  primary  operands 
were  received  from  the  CPU.  The  primary  operands  for  the  PostJEvent  instruction  were 
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the  FROM_LP  (i.e.,  the  LP  to  post  the  event),  a  pointer  to  the  event  data  structure  in  CPU 
memory,  and  the  TO-LP  (i.e.,  designated  LP  to  receive  the  event).  After  receipt,  these 
operands  were  stored  in  the  first  three  general  purpose  registers  by  the  DBS  coprocessor. 

The  additional  operands  for  the  Post-Event  function  (i.e.,  LP.DELAY,  number  of 
input/output  arcs,  and  the  IP’s  simulation  time)  were  read  from  RAM,  which  was  refer¬ 
enced  by  the  partition  pointer  corresponding  to  the  LP  posting  the  event,  and  loaded  in 
registers  five  through  seven. 

Test  data  verifying  the  DES  coprocessor’s  implementation  of  the  required  register 
loading  for  the  Post-Event  function  are  shown  below.  Monitor,  M68,  at  23890  ns,  reflects 
the  DES  coprocessor’s  registers  loaded  with  the  CPU  provided  operands  and  the  additional 
operands  fetched  from  LP  O’s  RAM  partition.  At  time  23895  ns,  the  sum  of  registers  five 
and  seven  (i.e.,  LP_DELAY  and  simulation  time)  was  stored  in  register  eight.  This  sum 
is  the  new  simulation  time  which  was  updated  to  account  for  the  occurrence  of  the  next 
event  and  will  be  used  to  time  tag  the  output  messages. 


23855  NS  —  RAM  address  with  LP  simulation  time 

M2:  EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/MA  (value  =  "01000100011") 

VM0N3;  READ  /DES-SYS_TEST_BENCH/C0PR0C/MEM/P/M(547)  (value  = 

"00000000000000000000000000001010") 

M:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 

"00000000000000000000000000001010")  —  LP  simulation  time 

23870  NS 

M83:  EVENT  /DES-SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

23890  NS 


M72:  EVENT  /DES-SYS_TEST-BENCH/C0PR0C/C0P/BUFF-I0  (value  = 

"00000000000000000000000000001010") 

M68:  EVENT  /DES_SYS-TEST-BENCH/C0PR0C/C0P/REG_32  (value  = 

—  LP  to  post  event  —  memr-ptr  to  event 

("00000000000000000000000000000000", "011101110111011 101 11011101 110111"  , 
"00000000000000010000000000000001", "00000000000000000000001000100000" . 

—  LP  delay  —  #  of  input  I  #  output  arcs 

"00000000000000000000000000000100", "00000000000000100000000000000011"  , 
—  LP  simulation  time 


"00000000000000000000000000001010", "00000000000000000000000000001010" , 
"111 10011110011110011110011110000", "00000000000000000000000000000000")) 

23895  NS 


M68:  EVENT  /DES-SYS_TEST-BENCH/C0PR0C/C0P/REG-32  (value  = 

("OOOOOOOOOC'"  000000000000000000000", "0111011101110111011 1011 101 110111", 
"00000000000000010000000000000001" , "00000000000000000000001000100000" , 
"00000000000000000000000000000100", "OOOOOOOOOOOOOOlOOOOOOOOOOOOOOOll" , 


5-25 


—  post  event  time_tag 

"00000000000000000000000000001010", "00000000000000000000000000001 110", 
"111 1001111001 1110011110011110000", "00000000000000000000000000000000")) 

Sending  messages  was  performed  by  the  DES  coprocessor  reading  each  of  the  LP’s 
output  arc  identities  from  RAM  and  building  the  output  message  (i.e.,  “null”  or  event) 
by  comparing  with  the  designated  receiving  LP’s  identity  held  in  register  three.  The  test 
data  below  shows  the  output  arc,  read  from  RAM,  that  matches  the  designated  receiving 
LP  identity  in  register  three.  The  LP  identity  read  from  RAM  is  loaded  into  register  nine. 
If  it  does  not  match  the  identity  of  the  LP  to  receive  the  event,  a  null  message  is  sent  to 
this  output  arc. 


24105  NS  —  RAH  address  of  output_arc 

H2:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/MA  (value  =  "01000101000") 

—  output  LP  (node  1,  LP  1) 

VM0N8:  READ  /DES_SYS_TEST_BENCH/C0PR0C/MEM/P/M(552)  (value  = 

"00000000000000010000000000000001") 

M:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/DATAIN  (value  = 

"00000000000000010000000000000001") 

24118  NS 

M83:  EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

24140  NS 

M72:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 

"00000000000000010000000000000001") 

24145  NS 

M68:  EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/REG_32  (value  = 

("00000000000000000000000000000000", "0111011 1011101 110111011101 1101 11" , 

—  T0_LP  for  message 

"00000000000000010000000000000001" , "00000000000000000000001000100000" , 
"00000000000000000000000000000100", "00000000000000100000000000000011" , 
"00000000000000000000000000001010", "00000000000000000000000000001110" , 
"00000000000000010000000000000001", "00000000000000000000000000000000")) 

Both  event  and  “null”  messages  are  composed  of  four  fields.  The  CPU/DES  copro¬ 
cessor  interface  activity  is  similar  for  both  “null”  and  real  messages;  the  only  difference 
being  the  message’s  memory  pointer,  which  is  set  to  zero  for  “null”  messages.  System 
test  data,  verifying  the  required  interface  to  send  an  event  message  is  shown  below.  The 
test  data  verifies  a  CPU  read  of  the  first  field  (i.e.,  TO-LP)  of  the  output  message.  The 
CPU  read  cycle  was  initiated  by  an  interrupt  request  (i.e.,  INTR  =  ‘1’)  from  the  DES 
coprocessor  and  also  terminated  by  the  DES  coprocessor  by  asserting  the  ready  line  low  at 
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24313  ns.  The  remaining  message  fields  in  the  DES  coprocessor  registers  were  subsequently 
transferred  to  the  CPU,  where  the  message  is  packaged  and  routed  over  the  hypercube’s 
interconnection  circuitry. 


24215  NS 
H3: 
H89: 
24238  NS 
M85: 
H87: 
M86: 
H14: 
M16: 
HIS; 
24242  NS 
H83: 
24253  NS 
MS: 
M3: 


Ml: 

M89: 
24283  NS 
H95: 

24304  NS 
M83: 
24313  NS 
MS: 

M20: 

M90: 

M72: 


—  CPU  interrupt  request 

EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/INTR  (value  =  ’!') 
EVENT  /DES_Sys_TEST_BENCH/CPU/INTIN  (value  =  '1') 

—  CPU  acknowledges  with  control  lines 
EVENT  /DES_SYS_TEST_BENCH/CPU/WROUT  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/CPU/A150UT  (value  =  '1') 

EVENT  /DES_SYS_TEST_BENCH/CPU/M_IOOUT  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/WR  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS2  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/NPS1  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '1') 

—  DES  begins  write 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WTP  (value  =  '1') 
EVENT  /DES_SYS_TEST_BENCH/C0PRDC/C0P/INTR  (value  =  '0') 

—  TQ_LP  lor  message 

EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/DATAOUT  (value  = 
"00000000000000010000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/CPU/INTIN  (value  =  *0’) 

EVENT  /DES_SYS_TEST_BENCH/CPU/DATAIN  (value  = 
"00000000000000010000000000000001") 

EVENT  /DES_SYS_TEST_BENCH/CPU/CLOCK  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/WTP  (value  =  '0') 

—  DES  Coprocessor  ends  bus  cycle 
EVENT  /DES_SYS_TEST_BENCH/COPROC/COP/READYO  (value  =  '0') 
EVENT  /DES_SYS_TEST_BENCH/CPU/RDYIN  (value  =  '0') 

EVENT  /DES_SYS_TEST_BENCH/C0PR0C/C0P/BUFF_I0  (value  = 
"00000000000000000000000000000000") 


The  DES  coprocessor  relies  on  the  CPU  for  access  to  the  interconnection  circuitry 
of  the  tPSC/2  hypercube,  hence  an  interrupt  request  (i.e.,  INTR  =  ‘1’)  was  used  to  route 
the  message  fields  to  the  CPU  for  message  output.  The  CPU/DES  coprocessor  interface 
signals,  shown  in  Figure  5.5,  provided  the  control  needed  to  transfer  message  fields  between 
the  DES  coprocessor  and  the  CPU.  The  individual  bus  transfer  cycles  were  asynchronously 
regulated  via  the  interrupt  request  and  ready  lines  from  the  DES  coprocessor. 
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Figure  5.5.  CPU/DES  Interface  Signals  for  Post-Event  Output 


5.4  DES  Coprocessor  System  Performance 

Implementation  of  the  general  DES  algorithm  was  done  through  the  VHDL  behav¬ 
ioral  description  of  the  DES  coprocessor  system.  In  addition  to  verifying  the  algorithm’s 
operation,  an  overall  timing  analysis  was  made,  using  the  Zycad  VHDL  system  and  the 
associated  general  purpose  post  processor. 

Timing  data  was  collected  during  testbench  simulation  of  the  DES  coprocessor  sys¬ 
tem.  Using  this  data,  the  average  execution  times  of  the  primary  DES  functions  and  the 
CPU/DES  coprocessor  bus  transfer  cycle  times  were  obtained. 

Table  5.1  is  a  summary  of  DES  coprocessor  system  execution  times.  Timing  data 
used  to  compile  this  table  was  gathered  from  testbench  simulations  of  the  DES  coprocessor 
system  supporting  two  logical  processes  per  computing  node. 

The  significant  variance  in  execution  times  of  several  DES  functions  is  evident,  but 
not  unexpected.  The  average  execution  times  for  the  DES  functions  will  vary,  depending 
on  the  LP  configuration  of  the  simulation. 

The  large  variance  in  execution  times  for  the  Sim  Jnit  and  Post -Event  functions  is  at¬ 
tributed  to  the  difference  in  number  of  output  arcs  associated  with  the  two  executing  LPs. 
The  conservative  synchronization  protocol  requires  a  message  transmission,  to  each  out- 
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Table  5.1.  Function  Execution  Times  for  DES  Coprocessor  System 


DES 

Function 

Time  (ns) 

min 

max 

a 

SimJnit 

1118 

2468 

1793 

675.0 

Post-Msg 

962 

1060 

996 

38.0 

Get-Event 

1355 

1355 

1355 

0.0 

Post-Event 

1113 

2313 

1713 

600.0 

CPUJO 

201 

325 

212 

29.8 

Intr.CPU 

93 

105 

103 

3.4 

Serd.CPU 

65 

65 

65 

0.0 

put  arc  of  the  LP,  every  time  these  functions  are  executed.  Since  the  two  logical  processes 
analyzed  (i.e.,  node  0  oi  Figure  3.4)  have  one  and  three  output  arcs  respectively,  a  con¬ 
siderable  difference  in  execution  times  was  expected  for  functions  requiring  comprehensive 
communications  to  output  arcs. 

The  variance  in  execution  time  for  the  Post-Msg  function  is  also  a  function  of  the 
simulation’s  LP  configuration.  Unlike  the  SimJnit  and  Post-Event  functions,  the  variance 
in  Post-Msg  execution  times  is  related  to  the  number  of  input  arcs  passing  messages  to 
the  LP.  The  sequential  reading  of  RAM,  to  update  the  arcs-in  status  during  execution  of 
the  Post-Msg  function,  has  a  variable  execution  time.  Given  n  input  arcs  to  an  LP,  the 
Post-Msg  function  will  require  0(n/2)  time,  on  average,  to  match  the  input  LP,  using  a 
sequential  read  of  RAM.  The  LPs  simulated  have  one  and  two  inputs  j  jspectively  hence, 
the  variance  in  Post_Msg  execution  time  is  attributed  to  the  difference  in  average  RAM 
read  times  required  to  update  the  LP’s  input  arcs  status. 

The  variance  in  CPU  bus  cycle  transfers,  reflected  in  the  CPUJO  function,  are  a 
function  of  RAM  access  by  the  DES  coprocessor  and  a  limitation  of  the  testbench  design. 
The  DES  coprocessor  requires  one-half  clock  cycle,  at  test  startup,  prior  to  a  transition 
to  the  CPUJO  state.  The  significance  of  this  idle  time  diminishes  as  the  number  of  CPU 
writes  increases. 

The  RAM  access  time  affects  average  CPUJO  execution  time  when  operands  for 
the  SimJnit  function  are  written  to  the  DES  coprocessor.  The  limited  number  of  general 
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purpose  registers  available  to  the  DES  coprocessor,  requires  RAM  storage  of  a  portion  of  the 
Siin_Init  operands.  The  inclusion  of  RAM  write  cycles,  particularly  the  initial  write,  which 
involves  a  read  of  the  partition  pointer  table,  has  a  significant  impact  on  average  CPU  JO 
time.  The  effect  of  RAM  write  cycles  is  also  diminished  as  the  simulation  execution  time 
is  increased,  since  the  only  function  requiring  RAM  access  is  SimJnit,  which  is  executed 
only  once  per  simulation  for  each  LP. 

Average  bus  transfer  cycle  times,  between  the  CPU  and  DES  coprocessor,  comply 
with  the  two  clock  cycle  standarc.  of  the  Intel  80386  (19:5-353).  However,  strict  compliance 
to  this  standard  is  not  required  with  an  asynchronous  interface,  as  the  CPU  inserts  wait 
states,  when  needed,  to  complete  longer  bus  cycles. 
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VI.  Results  and  Recommendations 


6. 1  Introduction 

A  general  distributed  Discrete  Event  Simulation  (DES)  was  analyzed  to  determine 
the  system  requirements  for  a  hardware  accelerator.  A  two-phased  approach  was  used 
in  this  effort.  Initially,  a  generic  DES,  represented  by  a  carwash  model  was  analyzed 
to  determine  algorithm  bottlenecks  that  exhibited  a  potential  for  acceleration  through  a 
hardware  implementation. 

A  behavioral  description  of  a  hardware  coprocessor,  implementing  the  general  DES 
functions,  was  then  specified  using  the  IEEE  standard  VHSIC  Hardware  Description  Lan¬ 
guage  (VHDL).  The  coprocessor  behavioral  description  was  then  simulated  using  a  test- 
bench  representation  of  the  carwash  model. 

The  results  of  this  effort  are  summarized  in  this  chapter.  Additionally,  specific  topics 
and  related  issues  that  merit  further  consideration  are  presented. 

6.2  Summary  of  Findings 

Test  data  collected  from  simulations  of  the  carwash  model,  executed  on  the  eight  node 
Intel  iPSC/2  hypercube,  led  to  results  that  are  not  surprising.  The  most  time  consuming 
portions  of  this  general  DES  algorithm  are  those  requiring  communications  support  in  the 
distributed  hypercube  architecture. 

The  conservative  time  synchronization  protocol,  used  in  distributed  computing  ar¬ 
chitectures,  requires  a  significeat  amount  of  message  passing,  with  its  associated  com¬ 
munications  overhead,  to  ensure  simulation  progress  and  deadlock  avoidance.  The,  basic 
functions  of  a  distributed  Discrete  Event  Simulation  (DES)  algorithm  were  implemented 
with  the  SPECTRUM  simulation  testbed.  These  basic  functions  (i.e.,  initialization,  post¬ 
message,  get-event,  advance-time,  and  post-event)  were  found  to  require  varying  degrees 
of  communications  support. 

The  communication  requirements  were  of  two  types;  sending  and  receiving  messages. 
In  both  cases,  the  required  communications  resulted  in  significant  time  spent  waiting  to 
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complete  the  communications  operation.  During  this  waiting  time,  the  processor  remained 
idle  and  minimal  progress,  toward  simulation  completion  was  accomplished. 

Acceleration  of  general  DES  simulations  is  possible  by  eliminating  this  processor  idle 
time.  One  approach  to  freeing  the  processor  from  idle  wait  time  is  to  implement  the  basic 
discrete  event  simulation  functions  in  a  separate  coprocessor.  Ideally,  this  will  allow  the 
processor  to  focus  on  computational  activity,  with  little  idle  time,  while  the  coprocessor 
executes  the  basic  simulation  functions. 

Assuming  such  a  coprocessor  is  possible  and  a  sufficient  workload  is  available  to  keep 
the  processor  active,  the  potential  for  speedup  was  calculated.  A  factor  of  four  speedup 
is  possible  when  a  single  logical  process  is  executed  on  each  processing  node,  while  the 
speed-up  potential  for  two  logical  processes  per  node  approaches  twenty.  Both  estimates 
assume  ideal  coprocessor  support  in  that  the  processor  is  relieved  of  all  communications 
idle  time  associated  with  the  discrete  event  simulation  execution. 

The  behavioral  description  of  a  DES  coprocessor  was  presented.  A  state  machine 
representation,  using  a  conventional  von  Neumann  architecture,  was  used  to  define  the 
system  requirements  for  the  design  of  this  discrete  event  simulation  coprocessor. 

The  coprocessor  architecture  employs  a  set  of  general  purpose  working  registers, 
a  local  random  access  memory  for  operand  and  code  storage,  and  a  content  addressable 
memory  for  next  event  list  management.  The  DES  coprocessor  system  was  designed  with  a 
32-bit  architecture  to  provide  direct  compatibility  with  proven  microprocessor  technology. 

Simulations  of  the  DES  coprocessor  behavioral  description  verified  the  implementa¬ 
tion  of  the  basic  DES  algorithm  functions  and  confirmed  the  compatibility  with  the  Intel 
80386  32-bit  microprocessor.  Low-level  interfacing  with  the  communications  circuitry  of 
the  Intel  tPSC/2  hypercube  was  not  included  in  this  behavioral  description  as  proprietary 
restrictions  limited  access  to  the  operation  of  this  hardware. 

The  importance  of  simulating  physical  systems  continues  to  increase  in  many  fields. 
The  size  and  complexity  of  the  simulation  models  employed  is  also  increasing,  making 
the  need  for  improved  techniques  for  executing  these  simulations  paramount.  The  use 
of  a  discrete  event  simulation  coprocessor,  has  potential  for  overcoming  the  problem  of 
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communications  overhead  and  accelerating  the  execution  of  discrete  event  simulations,  in 
a  distributed  hypercube  architecture. 

6. 3  Recommendations 

Several  issues  concerning  the  design  and  implementation  of  a  discrete  event  simula¬ 
tion  accelerator  were  discovered  during  the  course  of  this  effort.  Consideration  should  be 
given  to  the  further  investigation  of  these  issues  to  determine  their  merit  and  potential  for 
enhancing  the  current  coprocessor  design. 

6.3.1  CAM  Storage  As  designed  the  coprocessor  relies  on  the  superior  memory 
management  capability  of  the  Intel  80386  processor  for  actual  event  storage.  Hence,  the 
coprocessor  is  forced  to  maintain  a  32-bit  pointer  to  the  address  in  physical  memory  where 
the  event  and  associated  data  structures  are  maintained. 

The  maintenance  of  this  32-bit  pointer  in  CAM  memory  is  unnecessary  as  it  pro¬ 
vides  no  information  the  CAM  can  utilize  in  searching  for  the  next  scheduled  event.  The 
potential  to  double  the  CAM  storage  capacity  exists,  if  this  physical  memory  pointer  can 
be  maintained  outside  the  CAM  and  still  b-^  associated  with  the  next  event  search  fields. 

6.3.2  CAM  Overflow  The  issue  of  potential  CAM  overflow  and  a  mechanism  to 
handle  such  an  event  was  not  thoroughly  addressed  by  this  thesis.  Eventually,  the  limited 
CAM  capacity  will  be  reached  and  exceeded  as  simulations  get  larger  and  more  events  are 
generated. 

The  use  of  multiple  CAMs  in  tandem  or  a  hierarchical  structure  of  multiple  CAMs 
could  be  employed  to  delay  this  eventual  overflow.  However,  the  time  required  to  swap 
in  previously  overflowed  events  from  mdn  memory,  and  the  best  swap  in  paradigm  (i.e., 
number  of  events/block  size)  requires  further  consideration. 

6.3.3  Input  Message  Status  Knowledge  of  which  input  arcs  have  satisfied  the  mes¬ 
sage  received  requirement  is  paramount  to  implementing  the  Chandy-Misra  conservative 
synchronization  protocol.  Tne  design  requirements  specify  a  sequential  search  algorithm, 
for  the  receiving  logical  process,  to  update  an  input  arcs  status  register. 
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For  a  small  number  of  input  arcs  (e.g.,  10  was  assumed  for  the  design)  this  technique 
is  sufficient.  However,  as  the  number  of  arcs  is  increased,  the  time  spent  updating  this 
necessary  status  register  may  be  unacceptable.  The  possibility  of  using  a  second  CAM,  or 
a  portion  of  the  next  event  CAM,  to  update  the  input  arc  status  has  potential  for  further 
acceleration  as  the  number  of  input  channels  is  increased. 

6.3.4  Interface  to  CPU  Communications  Hardware  Incorporation  of  the  direct  con¬ 
nect  module  in  the  jPSC/2  represents  a  major  improvement  in  the  hypercube’s  message 
passing  efficiency.  However,  the  DES  coprocessor  design  must  rely  on  the  CPU  to  ac¬ 
cess  the  DCM  module  to  send  and  receive  messages.  In  both  cases  this  involves  either 
interrupting  the  current  process  or  delaying  the  next  process’  activation. 

Access  to  DCM  hardware  documentation  was  not  available,  yet  the  potential  for 
even  greater  simulation  acceleration  is  present  by  allowing  the  DES  coprocessor  to  access 
the  DCM  directly.  A  direct  coprocessor  to  DCM  interface  would  eliminate  CPU  delays  in 
routing  the  event  and  null  messages  of  the  simulation  and  deserves  further  consideration. 


6-4 


opendix  A.  DES  System  Packages 


The  appendices  that  follow  provide  the  source  code  listings  of  the  VHDL  files  that 
comprise  the  DES  coprocessor  system  design  and  testbench.  This  first  appendix  contains 
the  system  packages  which  define  the  types,  subtypes,  constants,  and  functions  required 
by  the  remaining  files  to  implement  the  DES  coprocessor  design. 
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A.l  Bus  Resolution  Package 

This  following  appendix  presents  the  bus  resolution  function  used  in  the  DES  copro¬ 
cessor  design.  The  source  code  listing  is  taken  from  the  Zycad  VHDL  Reference  Manual 
(32:10-17,18),  and  tailored  for  this  application.  The  Bus_sys  package  provides  the  u  que 
bus  type  (i.e.,  sys.bus)  used  to  define  the  DES  coprocessor  and  the  necessary  resolution 
function,  in  terms  of  the  Zycad  defined  multi-valued  logic,  to  avoid  bus  conflicts.  Ad¬ 
ditionally,  type  conversion  functions,  allowing  MVL7-Vector-to-Sys_bus  and  Sys.bus-to- 
MVL7_Vector  is  provided. 
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—  FILE:  bus_sys.vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  Package  for  declaring  types,  subtypes,  eind 

—  functions  necessary  to  provide  bus  resolution 
for  the  DES  coprocessor  system. 

—  REFERENCE:  Zycad  Reference  Manual  pp.  10-17,18;  10-78.. 84 


library  ZYCAD; 

library  DESIGN; 

use  ZYCAD . ATTRIBUTES . all ; 

use  ZYCAD. TYPES. all; 

use  ZYCAD.BV. ARITHMETIC. all; 

use  WORK. all; 

package  Bus.sys  is 


- BUS  RESOLUTION  TYPES  k  FUNCTIONS - 

—  type  SYS.BUS  is  array  (INTEGER  range  <>)  of  MVL7; 

—  function  BUS_FUNC( INPUT  :  SYS.BUS)  return  MVL7; 

—  subtype  BUS.BIT  is  BUS.FUNC  MVL7; 

—  type  BUS_TYPE  is  array  (INTEGER  range  <>)  of  BUS_BIT; 

- BUS  RESOLUTION  with  ZYCAD - 

—  truth  table  for  "BUS.FUNC"  function 
constant  tbl_BUS_FUNC :  MVL7_TABLE  := 
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function  BUS_FUNC( INPUT:  MVL7_VECT0R)  return  MVL7; 
attribute  REFLEXIVE  of  BUS_FUNC:  function  is  TRUE; 

attribute  RESULT_INITIAL_VALUE  of  BUS.FUNC:  function  is  MVL7 'P0S( ' Z ' ) ; 
attribute  TABLE_NAME  of  BUS.FUNC:  function  is  "BUS_SYS . tbl_BUS_FUNC" ; 

subtype  BUS.BIT  is  BUS.FUNC  MVL7; 

type  SYS_BUS  is  array  (NATURAL  range  <>)  of  BUS_BIT; 

- TYPE  CONVERSION  FUNCTIONS - 

function  CHANGE(INPUT  :  MVL7_VECT0R)  return  SYS_BUS;  —  ZYCAD  drives  overloaded 

function  CHANGE(INPUT  :  SYS_BUS)  return  HVL7_VECT0R;  —  ZYCAD  drives  overloaded 

attribute  CLOSELY_RELATED_TCF  of  CHANGE:  function  is  TRUE;  —  ZYCAD  drive  attribute 

end  Bus_sys; 
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package  body  Bus_sys  is 


—  BUS.FUNC 


-  Purpose:  Resolution  function  for  MVL7  signals.  Used 

-  nith  the  DATA_BUS  for  multiple  driver  processes . 

-  Zycad  versi'  (WiredX)  from  VHDL  reference  manual 
(10-84) 


function  BUS_FUNC(INPUT  ;  MVL7_VECT0R)  return  MVL7  is 
variable  RESOLVED  :  MVL7  :=  ’Z'; 


begin 

for  i  in  INPUT’range  loop 

RESOLVED  :=  tbl_BUS_FUNC (RESOLVED.  INPUT(i)); 
end  loop; 
return  RESOLVED; 
end  BUS.FUNC; 


conversion  functions  for  driving  various  types 


function  CHANGE  ( INPUT :SYS_BUS)  return  MVL7_VECT0R  is 
begin 

return  MVL7_VECT0R ( INPUT) ; 
end  CHANGE; 

function  CHANGE  (INPUT:  MVL7_VECT0R)  return  SYS.BUS  is 
begin 

return  SYS.BUS ( INPUT) ; 
end  CHANGE; 

eM  Bus_sys; 
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A. 2  DES  Coprocessor  System  Package 

This  appendix  contains  the  system  package  containing  the  necessary  type  and  con¬ 
stant  declarations  for  the  DES  coprocessor  design.  The  generic  time  delays  used  in  the 
DES  VHDL  behavior  ?re  defined  in  this  package.  Functions  that  provide  type  coversion 
operations  and  bit  vector  manipulation  required  by  the  DES  coprocessor  behavior  are 
included  in  this  package. 
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—  FILE:  system. vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  Package  for  encapsulating  types  and  constants 

for  the  DES  Coprocessor  Design. 


library  ZYCAD; 
library  DESIGN; 
use  ZYCAD. ATTRIBUTES. all; 
use  ZYCAD. TYPES. all; 
use  ZYCAD.BV. ARITHMETIC. all; 
use  WORK. all; 
use  WORK. BUS_SYS. all; 

package  System  is 

- CONSTANT  TIME  DELAYS 

constant  ODEL  :  TIME  :=  15  ns; 

constant  RDEL  :  TIME  :=  60  ns; 

constant  WDEL  :  TIME  :=  60  ns; 

constant  ALUDEL  :  TIME  :=  60  ns; 

constant  MADEL  :  TIME  ;=  30  ns; 

constant  PER  :  TIME  :=  125  ns; 

constant  DISDEL  :  TIME  :=  20  ns; 
constant  GDEL  :  TIME  :=  5  ns; 
constant  FFDEL  :  TIME  :=  5  ns; 
constant  BUFDEL  :  TIME  :=  10  ns; 

- HISC  CONSTANTS - 

constant  IRsize  :  POSITIVE  :=  32; 
constant  MAsize  :  POSITIVE  :=  11; 
constant  Ndata  :  POSITIVE  :=  32; 
constant  Naddr  :  POSITIVE  :=  11; 
constant  N  :  POSITIVE  :=  32; 

- SUBTYPES  FOP,  BIT_ VECTORS - 

subtype  DWORD  is  SYS_BUS(31  downto  0);  —  resolved  data_bus 

subtype  ADDR  is  MVL7_VECT0R(31  downto  0); 

subtype  M_ADD  is  MVL7_VECT0R(10  downto  0);  —  4K  RAM  (4  byte/Addr) 

- GENERAL  PURPOSE  REGISTERS - 

type  REG32  is  array  (NATURAL  range  <>)  of  DWORD;  —  GP  (32  bit)  registers 

- fjjSc  FUNCTIONS - 

function  DWORD_to_MADD(DBL:  DWORD)  return  M_ADD; 
function  JCIN_DWORDS(HI,  LO:  DWORD)  return  DWORD; 
function  MAP_FIELDS(TO_LP,  FROM.LP:  DWORD)  return  DWORD; 
function  HI_LO_ADD(DBL_WORD  : DWORD)  return  DWORD; 


end  System; 


—  Bits  in  Instr  Reg 

—  Bits  in  RAM  address 

—  RAM  Data  bus  size 

—  RAM  Addr  bus  size 

—  Parallel  I/O  port  size 


—  DES  output  delay 

—  RAM/DES  read  delay 

—  DES  write  delay 

—  ALU  function  delay  (PER/2) 

—  DES  memr  access  delay 

—  clock  period  (16  MHz) 

—  RAM  disable  delay 

—  Parallel  I/O  gate  delay 

—  Parallel  I/O  F-F  delay 

—  Parallel  I/O  buffer  delay 
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package  body  System  is 


—  DWORD_to_MADD 


—  Purpose;  Converts  (32)  bit  DWORD  to  (MAsize)  bit 
RAM  memory  address  (32  bit  ->  11  bit) 


function  DWORD_to_MADD(DBL:  DWORD)  return  M_ADD  is 
variable  ADDRESS  ;  M_ADD; 


begin 

for  I  in  MAsize-1  downto  0  loop 
if  DBL(I)  =  '0'  then 
ADDRESS(I)  :=  'O'; 
elsif  DBL(I)  =  '1'  then 
ADDRESS(I)  ;=  '1'; 
else  ADDRESS(I)  ;=  '1'; 
end  if; 

assert  (DBL(I)  =  '0'  or  DBL(I)  =  '1') 
report  "Invalid  Memory  Address"; 
end  loop; 
return  ADDRESS; 

end  DWORD. to.MADD; 


—  JOIN.DWORDS 


--  Purpose:  joins  "hi"  WORD  (1st  DWORD)  with  "lo" 
WORD  (2nd  DWORD)  for  return  DWORD 


function  JOIN.DWORDS (HI ,  LO:  DWORD)  return  DWORD  is 
variable  JOINED  :  DWORD; 


begin 

for  I  in  31  downto  HI 'LENGTH/2  loop 
if  Hid  -  (HI 'LENGTH/2))  =  '0'  then 
JOINED(I)  :=  'O'; 

tlsif  Hid  -  (HI 'LENGTH/2))  =  '1'  then 
JOINED(I)  :=  '1'  ; 

else 

JOINED(I)  ;=  '1'; 
end  if ; 


A-7 


assert  (HI(I  -  (HI 'LENGTH/2) )  =  '0'  or  HI(I  -  (HI'LENGTH/2) )  =  '1') 
report  "Invalid  DWORD  splicing"; 
end  loop; 

for  I  in  (LO' LENGTH/2  -  1)  downto  0  loop 
if  L0(I)  =  >0>  then 
JOINED(I)  ;=  'O'; 
elsif  L0(I)  =  '1'  then 
JOINED (I)  :=  '1'; 

else 

JOINED (I)  :=  '1'; 
end  if; 

assert  (L0(I)  =  '0'  or  L0(I)  =  '1') 
report  "Invalid  DWORD  splicing"; 
end  loop; 

return  JOINED; 
end  JOIN.DWORDS; 


—  MAP.FIELDS 


—  Purpose:  maps  valid  fields  of  (2)  input  DWORDs  to 
construct  return  DWORD 


function  MAP.FIELDS (TO.LP,  FROM.LP:  DWORD)  return  DWORD  is 
variable  COMPACT  :  DWORD; 


begin 

for  I  in  31  downto  13  loop 
COMPACT(I)  :=  'O'; 
end  loop; 

for  I  in  12  downto  8  loop 
if  TO.LPd  -  8)  =  '0'  then 
COMPACT(I)  :=  'O'; 
elsif  TO.LPCI  -  8)  =  '1'  then 
COMPACT(I)  :=  '1'; 

else 

COMPACT(I)  :=  '1'; 
end  if; 

assert  (T0_LP(I  -  8)  =  '0'  or  T0_LP(I  -  8)  =  '1') 
report  "Invalid  Event  Address  Mapping"; 
end  loop; 

for  I  in  7  dounto  6  loop 

if  FROM.LPCI  +  11)  =  '0'  then 
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COMPACT(I)  :=  'O'; 
elsif  FROM_LP(I  +  11)  =  '1'  then 
COMPACT(I)  :=  '1'; 

else 

COMPACT(I)  :=  '1'; 
and  if; 

assert  (FR0M_LP(I  +  11)  =  '0'  or  FR0M_LP(I  +  11)  =  '1') 
report  "Invalid  Event  Address  Mapping"; 
end  loop; 

for  I  in  4  dovnto  0  loop 
if  FR0M_LP(I)  =  '0'  then 
COMPACT(I)  :=  'O'; 
r,-.  i  FR0M_LP(I)  =  '1'  then 
COMPACT(I)  :=  '1'; 

else 

COMPACT(I)  :=  '1'; 
end  if ; 

assert  (FR0M_LP(I)  =  '0'  or  FR0M_LP(I)  =  '1') 
report  "Invalid  Event  Address  Mapping"; 
end  loop; 

return  COMPACT; 
end  MAP.FIELDS; 


“  HI_L0_ADD 


—  Purpose:  Add  high  and  low  WORDS  of  DWORD  to  return 
a  DWORD  "SUM" 


function  HI_LO_ADD(DBL_WORD  :  DWORD)  return  DWORD  is 
variable  HI.WORD,  L0_W0RD,  SUM  :  DWORD; 


begin 

HI_W0RD(15  dovnto  0)  :=  DBL_W0RD(31  dovnto  16); 

L0_W0RD(15  dovnto  0)  :=  DBL_W0RD(15  downto  0); 

HI_W0RD(31  dovnto  16)  :=  "0000000000000000"; 

L0_W0RD(31  doHnto  16)  :=  "0000000000000000"; 

SUM  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(HI_WORD)))  + 

BVtoI(MVL7VtoBV(CHANGE(L0_W0RD)))))); 


return  SUM; 
end  HI_L0_ADD; 

end  System; 
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Appendix  B.  DES  Coprocessor  VHDL  Design 


This  appendix  contains  the  chip  level  architecture  for  the  DES  coprocessor  system 
as  shown  in  Figure  4.2.  The  entity  declaration  and  components  that  make  up  the  DES 
coprocessor  system  is  given.  VHDL  behavioral  descriptions,  as  described  in  Chapters  3 
and  4,  for  each  component  are  included  in  the  following  appendices. 
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B.  1  DES  Coprocessor  Structure 

This  Eppendix  contains  the  architectural  body  of  the  DES  coprocessor.  The  en¬ 
tity  declaration  defines  the  DES  coprocessor  in  terms  of  sytem  input  and  output  ports. 
The  components  that  make  the  DES  coprocessor  system  are  declared  and  defined  with 
generic  parameters  and  port  descriptions.  The  signal  mapping  that  connects  the  system 
components  is  also  given. 
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—  FILE:  des_structure . vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  Structural  architectural  body  of  the  DES 

coprocessor.  Entity  declaration  defining  the  inputs  and 
for  the  DES  Coprocessor  is  given.  Components  and 
signals  for  the  DES  coprocessor  system  are  declared  and 
then  instantiated. 


library  ZYCAD; 

library  DESIGN; 

use  ZYCAD. TYPES. all; 

use  ZYCAD. COMPONENTS. all; 

use  ZYCAD.BV. ARITHMETIC. all; 

use  WORK. SYSTEM. all; 

use  WORK. BUS.SYS. all; 

entity  DES_sys  is 

port( 


RUN 

:  in 

MVL7; 

CLK 

:  in 

BIT; 

RESETIN 

in 

MVL7; 

WR 

in 

MVL7; 

NPSl 

in 

MVL7; 

NPS2 

in 

MVL7; 

CMDO 

in 

MVL7; 

INTR 

out 

MVL7; 

READYO 

inout  MVL7; 

BUSY 

out 

MVL7; 

ERROR 

out 

MVL7; 

ADD.STR 

in 

MVL7; 

SYS  IN 

in 

DWORD; 

SYS OUT 

out  DWORD) ; 

end  OES_sys; 

architecture  CHIP_LEVEL  of  DES_sys  is 


—  status  enable/run  (Vcc) 

—  System  CLK2  (1/2  CPU  sys) 

—  RESET  from  CPU 

—  W/Rf  from  CPU 

—  M/IO*  from  CPU 
--  A16  from  CPU 

—  A2  from  CPU  ('1'-  data,  'O'-  opcode) 

—  interrupt  request  to  CPU 

—  wait  state  cntrl  sig  (xfer  acknowledge) 

—  coprocessor  status  signal 

—  coprocessor  error  signal 

—  ADS*  address  valid  strobe 

—  CPU  data  bus 

—  CPU  data  bus 


DES  Synchronization  Coprocessor  component 


component  DES 


generic (RDEL,  WDEL, 

port(  DATAin 
DATAout 
MA 

INTR 


.  MADEL,  PER:  TIME); 

:  in  DWORD; 

:  out  DWORD; 

:  out  M_ADD; 

:  out  MVL7; 


—  data_bus  port 

—  data_bus  port 

—  ram  address  port 

—  int  request  to  CPU 
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RDP,  WTP 

out  MVL7; 

NINTin 

in  MVL7; 

RDC,  WTC, 

NE.  RST 

out  MVL7: 

10.  RW 

out  MVL7; 

CLK 

in  BIT, 

WR.  NPSl, 

NPS2.  CMDO 

in  MVL7; 

RESETIN, 

RUN 

in  MVL7; 

READYO 

inout  MVL7; 

BUSY.  ERROR 

out  MVL7); 

—  parallel  i/o  control 

—  parallel  i/o  request 

—  CAM  i/o  control 

—  RAM  i/o  control 

—  1/2  CPU  system  clock 

—  CPU  control  input 

—  CPU  control  input 

—  DES  status/control 

—  DES  status/control  output 


end  component; 


—  Memory  component 

component  RAM_MEM 

generic  (Ndata:  Positi\  :<  :=  32; 

Naddr:  Positive  :=  11; 

RDEL.  DISDEL',  TIME); 

—  #  of  data  lines 

—  #  of  address  lines 

—  read  and  disable  delay 

port  (DATAI;  in  DWORD; 

DATAO:  out  DWORD; 

ADDR:  in  MVL7_VECT0R( Naddr- 1  dounto  0) ; 
CE:  in  MVL7; 

RW:  in  MVL7); 

end  component; 

—  data  in  lines 

—  data  out  lines 

—  address  lines 

—  chip  enable  (high) 

—  read  (low)  and 

—  write  (high) 

—  Parallel  I/O  Latch  component 

component  PAR 

genericCGDEL.  FFDEL.  BUFDEL:  TIME); 

—  delay  times 

port(  DI  :  in  DWORD; 

DO  :  out  DWORD; 

NDSl.  DS2.  MD.  NCLR  :  in  MVL7; 

STB  :  in  MVL7; 

HINT  :  out  MVL7); 

—  data  input 

—  data  output 

—  I/O  control 

—  addr  latch  enable 

—  interrupt  request 

end  component; 

—  Content  Addressable  Memory  (CAM)  component 
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component  C_MEM 

generic(RDEL,  WDEL,  DISDEL:  TIME); 

port(  DATAinto  :  in  DWORD; 

DATAoutof  :  out  DWORD; 

CLK  :  in  BIT; 

DSl,  NDS2.  MODE,  N_CLR  ;  in  MVL7) ; 


—  delay  times 

—  data  in 

—  data  ouput 

—  DES  clock 

—  CAM  control 


end  component; 


—  Signal  declarations  connecting  chip_level 

—  components  in  DES  coprocessor. 


signal  DATA.BUS  :  DWORD;  -  deS  data  bus 

signal  DATA in,  DATAout  ;  DWORD; 

signal  MA  :  M_ADD; 

signal  DATAI,  DATAO  :  DWORD; 

signal  NINTin,  10,  RW,  RDP,  WTP  :  MVL7; 

signal  RDC,  WTC,  NE,  RST  :  MVL7; 

signal  ADDR  :  MVL7_VECT0R(10  downto  0); 

signal  DATAinto,  DATAoutof  :  DWORD; 

signal  CE  :  MVL7; 

signal  DSl,  NDS2,  MODE,  N_CLR  :  MVL7; 
signal  NDSl,  DS2,  MD,  NCLR,  STB,  NIHT  :  MVL7; 
signal  ZERO  :  MVL7  :=  'O'; 
signal  ONE  ;  MVL7  :=  '1'; 


-  Component  instsuitiations . 


begin 
COP:  DES 

generic  map(RDEL,  WDEL,  ODEL,  MADEL,  PER) 

port  map( 

DATAin  =>  DATAin, 

DATAout  =>  DATAout, 

MA  =>  MA, 

INTR  =>  INTR, 

10  =>  10. 

RW  =>  RW, 

RDP  =>  RDP, 

WTP  =>  WTP, 

NINTin  =>  NINTin, 

RDC  =>  RDC, 

WTC  =>  WTC, 
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»E  =>  NE, 

RST  =>  RST, 

RESETIN  =>  RESETIN, 
RUN  =>  RUN, 

CLK  =>  CLK, 

WR  =>  WR, 

NPSl  =>  NPSl, 

NPS2  =>  NPS2, 

CMDO  =>  CHDO, 

READYO  =>  READYO, 
BUSY  =>  BUSY, 

ERROR  =>  ERROR) ; 


MEM:  RAM.MEM 

generic  map(Ndata,  Naddr,  RDEL,  DISDEL) 

port  map( 

DATAI  =>  DATAI, 

DATAO  =>  DATAO, 

ADDR  =>  ADDR, 

CE  =>  CE, 

RW  =>  RW); 


PARIN:  PAR 

generic  map(GDEL,  FFDEL,  BUFDEL) 

port  mapC  DI  =>  SYSIN, 

DO  =>  DATA.BUS, 

NDSl  =>  ZERO, 

DS2  =>  RDP, 

MD  =>  ZERO, 

NCLR  =>  ONE, 

STB  =>  ADD_STR, 

NINT  =>  NINT) ; 


PAROUT:  PAR 

generic  map(GDEL,  FFDEL,  BUFDEL) 

port  map(  DI  =>  DATA„BUS, 

DO  =>  SYSOUT, 

NDSl  =>  ZERO, 

DS2  =>  WTP, 

MD  =>  ONE, 

NCLR  =>  ONE, 

STB  =>  ZERO, 

NINT  =>  open); 


CAM:  C_MEM 


—  Logic  delays 

—  CPU  input 

—  out  to  DES 

—  low  device  select 

—  variable  dev,  sel. 

—  Read  Mode 

—  no  clear 

—  CPU  strobe  (high) 

—  low  interrupt 


—  Logic  delays 

—  DES  output 

—  out  to  CPU 

—  low  device  select 

—  variable  dev.  sel. 

—  Write  Mode 

—  no  clear 

—  no  strobe 

—  no  interrupt 
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generic  map(RDEL,  WDEL,  DISDEL) 
port  map( 

DATAinto  =>  DATAinto, 
DATAoutoi  =>  DATAoutof , 
CLK  =>  CLK, 

DSl  =>  DSl, 

NDS2  =>  NDS2, 

MODE  =>  MODE, 

N_CLR  =>  N_CLR) ; 


—  Signal  mapping  between  components 


DATA in 

<=  DATA.BUS; 

—  cop 

input 

DATA.BUS 

<=  DATAout; 

—  cop 

output 

DATA! 

<=  DATA.BUS; 

—  ram 

input 

DATA.BUS 

<=  DATAO; 

—  ram 

output 

DATAinto 

<=  DATA.BUS; 

—  cam 

input 

DATA.BUS 

<=  DATAoutof ; 

—  Ccun 

output 

ADDR 

<=  MA; 

NIMTin 

<=  NINT; 

CE 

<=  10; 

DSl 

<=  RDC; 

NDS2 

<=  WTC; 

MODE 

<=  NE; 

N_CLR 

<=  RST; 

end  CHIP.LEVEL; 
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B.2  DES  Coprocessor  Behavior 

The  DES  coprocessor  architectural  behavior  is  given  in  this  appendix.  Each  of  the 
states  of  the  DES  coprocessor,  shown  in  Figure  4.1,  is  implemented  by  a  dedicated  VHDL 
process  within  this  behavior.  The  primary  DES  simulation  operations  (i.e.,  initialize,  post¬ 
message,  get-next-event,  and  post-event)  are  included  as  procedures  that  are  called  in  the 
execute  process. 

Additional  procedures,  that  implement  frequently  needed  operations  (i.e.,  interrupt 
CPU  and  send  null  message)  are  also  included  in  this  behavior.  The  signal  multiplexing, 
used  to  resolve  signals  driven  by  multiple  processes,  is  also  included  at  the  end  of  the 
behavior. 
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—  FILE.  des_beh. vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  Architectural  BEHAVIOR  of  the  DES  Coprocessor 

Implements  four  (4)  basic  operations  (Initialize, 
Post_Message,  Get_Next_Event ,  and  Post_Event  for 
a  conservative  synchronization  protocol,  discrete 
event  simulation. 


library  ZYCAD; 

library  OESIGN; 

use  ZYCAD. TYPES. all; 

use  ZYCAD. BV.ARITHMETIC. all; 

use  WORK. all; 

use  WORK. SYSTEM. all; 

use  WORK. BUS.SYS. all; 


—  ENTITY  declaration  for  DES  coprocessor 


entity  DES  is 


genericCRDEL,  WDEL,  ODEL,  MADEL,  PER:  TIME); 


port(DATAin 

:  in  DWORD; 

— 

DATA.BUS  PORT 

DATAout 

:  out  DWORD := 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"; 

— 

DATA.BUS  PORT 

MA 

:  out  M_ADD; 

— 

RAM 

ADDRESS  PORT 

INTR 

:  out  MVL7; 

— 

Int 

request  to  CPU 

RDP,  WTP 

:  out  MVL7; 

“ 

PAR 

I/O  CONTROL 

NINTin 

:  in  MVL7; 

— 

PAR 

I/O  STATUS  REQUEST 

RDC,  WTC,  NE,  RST 

:  out  MVL7; 

— 

CAN 

I/O  CONTROI  Si  Mstr  CLR 

10.  RW 

:  out  MVL7; 

— 

RAN 

CONTROL 

CLK 

:  in  BIT; 

— 

1/2 

CPU  SYSTEM  CLOCK 

WR,  NPSl,  NPS2,  CMDO 

:  in  MVL7; 

— 

CPU 

CONTROL  INPUT 

RESETIN,  RUN 

;  in  MVL7; 

— 

CPU 

CONTROL  INPUT  (RUN=STEN) 

READYO 

;  inout  MVL7 ; 

— 

CPU 

CONTROL  OUTPUT 

BUSY,  ERROR 

;  out  MVL7); 

— 

CPU 

CONTROL  OUTPUT 

end  DES; 


~~  ARCHITECTURAL  BEHAVIOR  of  DES  coprocessor 


architecture  BEHAVIOR  of  DES  is 
signal  STOP:  BIT; 

signal  BUSYB,  BUSYC,  BUSYS,  BUSYE:  MVL7; 
signal  RDY,  RDYC,  RDYE:  MVL7; 
signal  BRW,  RWS,  RWC,  RWE:  MVL7; 
signal  BIO.  lOS,  IOC,  lOE:  MVL7; 
signal  BMA,  MAS.  MAC,  MAE:  M_ADD; 
signal  RDCB,  RDCS,  RDCE:  MVL7; 
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signal  WTCB,  WTCS,  WTCE:  MVL7; 
signal  NEB,  NES,  NEE:  MVL7; 

signal  CPU_IO,  CPU.IOS,  CPU_IOC,  START,  EXECUTE:  BOOLEAN; 

signal  DONE,  DONEC,  DONEE:  BOOLEAN; 

signal  lOWAIT,  lOWAITS,  lOWAITC,  lOWAITE:  BIT; 

signal  MATCH  :  BOOLEAN  ;=  false;  —  update  ARCS_IN_STAT 

signal  LOADED;  BOOLEAN  :=  false;  —  GP  Regs  have  next  event 

signal  HAD_EVENT:  BOOLEAN  ;=  false;  —  have  event  for  CPU 

signal  SEND_NULL:  BOOLEAN  :=  false;  —  Get_Event  was  "null" 


General  purpose  REGISTERS  are  declared  as  signals, 

—  Receive  inputs  from  START.  CPU_IO_PROC,  and  EXECUTE.PROC  which  are  mux'd 
and  the  most  recent  process  assertion  is  new  value  of  Reg_32(X) . 


signal  Reg_32,  Reg_32S,  Reg_32C.  Reg_32E  :  REG32(1  to  10); 
signal  BUFF.IO,  BUFF.IOS,  BUFF.IOC,  BUFF_I0E:  DWORD' = 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"; 
signal  FLAGS,  FLAGSS,  FLAGSE:  DWORD; 
signal  IR,  ACC:  DWORD; 

signal  OPCODE:  MVL7_VECT0R(31  downto  0);  —  temp  for  OPCODE 


begin 

--  Run  (STEN)  Process  (Chip  select  circuitry  or  Vcc) 

RUN.PROC:  process(RUN) 
begin 

if  RUN  =  '1'  then 
STOP  <=  'O'; 
else 

STOP  <=  ' 1 ' ; 
end  if; 

end  process  RUN.PROC; 

State  Process  (CURRENT  STATE  and  state  transitions) 

STATE:  process(RUN,  STOP,  lOWAIT,  CLK,  START.  CPU.IO,  EXECUTE) 
begin 

if  (not  RUN'STABLE)  and  (RUN  =  '1')  then 
START  <=  true; 

elsif  ((RUN'STABLE)  and  (STOP  =  '0')  and  (lOWAIT  =  '0') 

and  (not  CLK'STABLE))  then 

if  (START  and  (NPSl  =  '0')  and  (NPS2  =  '1')  and  (WR  =  '1'))  then 
CPU.IOS  <=  true; 
else 

CPU.IOS  <=  false; 
end  if; 

if  (START  and  (WR  /=  '1'))  then 
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EXECUTE  <=  true ; 
else 

EXECUTE  <=  false; 
end  if; 
end  if ; 

end  process  STATE; 

—  Start  Process  (Housekeeping  and  Operating  in  Idle  state) 

- ***************************************4:***^^*****************.^ 

CAM  overflow  swapping  and  memory  management  (garbage  collection) . 

—  Checks  for  CAM  overflow  events  stored  in  RAM.  Uses  a  circular  queue  to 

—  take  events  (one  per  cycle)  from  head  and  move  to  CAM. 

—  Check  FLAGS  register: 

—  bit(O):  ’1'  =  CAM  full;  'O’  =  CAM  not  full 

—  bits(28->23) :  Qty  of  events  in  RAM 

bits (22->12) :  start  addr  (head)  of  events  in  RAM 

—  (3)*DW0RD  per  event  in  RAM 


START_PROC:  process 


variable/constants  to  manage  circular  queue  of  CAM  overflow  events 


constant  First_q_Addr  :  DWORD 
constant  Last_C)_Addr  ;  DWORD 
variable  Next. Addr  :  DWORD 

begin 

wait  on  START  until  START; 

BUSTS  <=  'O’, 

'1'  after  ODEL; 


00000000000000000000001000011 111"; 
00000000000000000000000110111111"; 
00000000000000000000000000000000"; 


—  unassert  "busy"  signal 


reset/ initialization 


if  RESETIN  =  '1'  then 
lOWAITS  <=  '1'; 

FLAGSSdl  downto  0)  <=  "010000111110";  —  tail  of  RAM  queue 

—  and  CAM  status  bit 

FLAGSS(22  downto  12)  <=  "01000011111";  —  head  of  RAM  queue 

FLAGSS(28  downto  23)  "000000";  —  #  events  in  RAM 

lOWAITS  <=  'O’; 
end  if ; 


—  if  CAM  not  full  and  events  in  RAM  ->  move  to  CAM 


if  (FLAGS(O)  =  '0'  and 

BVtoI(KVL7VtoBV(CHANGE(FLAGS(28  downto  23))))  >  0)  then 
lOWAITS  <=  '1'; 

Next_Addr(10  downto  0)  :=  FLAGS(22  downto  12);  —  head  of  queue 
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loop  to  read  RAM  and  load  GP  registers  with  event  to  move  to  CAM 
—  Reg_32(2)  <=  T0/FR0M_LP  ids 
—  Reg_32(3)  <=  event  Tima_Tag 
—  Reg_32(4)  <=  Memr_Ptr  to  event 


L0AD_REG_L00P : 

for  I  in  2  to  4  loop  __  2  to  4:  indx  regs 

MAS  <=  DWORD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 
RWS  <-  'O'  after  ODEL;  —  read  cycle 

IDS  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF_I0S  <=  DATAin  after  FFDEL;  —  event  data 

wait  for  FFDEL; 

RWS  <=  'Z'  after  ODEL  +  GDEL; 
lOS  <=  'O'  after  ODEL; 

MAS  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg_32S(I)  <=  BUFF.IOE  after  FFDEL;  —  load  event  data 

Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_Addr) ) )  +  1))); 

wait  for  ODEL; 
end  loop  L0AD_REG_L00P; 


—  CAM  write  loop  (event.id,  event  time_tag,  memr_ptr  to  msg) 
CAM_SAVE_EVENT; 

for  I  in  2  to  4  loop  —  ■^se  gp  reg  indx 

BUFF_I0S  <=  Reg_32(I)  after  FFDEL;  —  next  event  field 

wait  for  FFDEL; 

DATAout  <=  BUFF.IOE  after  ODEL,  —  put  data  on  bus 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2+DDEL  +  WDEL; 
wait  for  ODEL; 

NES  <=  '0'  after  ODEL; 

WTCS  <=  '1'  after  ODEL; 
wait  for  WDEL; 

WTCS  <=  'O'  after  ODEL; 

NES  <=  'Z'  after  ODEL; 
wait  for  ODEL; 
end  loop  CAM.SAVE.EVENT; 


Update  CAM.FULL  bit  (FLAGS(O))  with  automatic  reading  of  CAM  status 

RDCS  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF.IOS  <=  DATAin  after  FFDEL; 

RDCS  <=  >0'  after  ODEL; 
wait  for  ODEL; 
if  BUFF_I0S(0)  =  '1'  then 
FLAGSS(O)  <=  '1'; 
end  if; 

lOWAITS  <=  'O'; 
end  if; 


—  CAM  read 


—  CAM  is  full 
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end  process  START_PROC; 

**************************************************************^^^^^^^^^^^ 
CPU_I0  Process  (Get  instruction  and  data  from  CPU) 

—  Pseudo  code 

—  If  INPUT  (WR  =  '1') 

read  parallel  (RDP  =  '1')  and  load  into  BUFF_I0  register 
End  bus_cycle  (READYO  =  'O’) 

If  OPCODE  (A2  =  ’O’) 

load  instruction  register  (IR  <=  BUFF_I0) 

If  OPERAND  (A2  =  ’!’) 

load  operand(s)  into  gen  purpose  register  (REG_32(X)  <=  BUFF_I0) 

““  Transition  to  execute  state  at  end  of  last  bus  xfer  cycle 

CPU_I0_PR0C:  process 

variable  num  :  INTEGER  :=  1;  —  register  counter 

variable  Next_Addr  :  DWORD;  —  another  GP  register 

begin 

Bait  on  CPU_I0  until  CPU_I0;  --  CPU  writes  only!! 


””  f®ihitialize  num  counter  if  previous  opcode  was  executed 

if  DONE  then 
num  ;=  1; 

DONEC  <=  false; 

BUSYC  <=  ’1’  after  ODEL;  —  prompt  for  Test_Bench 

end  if ; 

no  state  changes 
new  bus  cycle 
synchronization 

read  in  data  bus 
DES  reads  (after  STRB) 
DATA.BUS  into  BUFF.IO 


opcode  was  read  from  data_bus 
if  (CMDO  =  ’O’)  then 

IR  <=  BUFF_I0C  after  FFDEL;  —  load  inst  reg  w/opcode 

wait  for  FFDEL; 

OPCODE  <-  CHANGE(IR);  __  ease  of  reading 

BUFF.IOC  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  1  ns; 

—  clears  buff  (resolve) 


lOWAITC  <=  ’ 1 ’ ; 

RDYC  <=  ’1’  after  ODEL; 
wait  until  not  CLK’ stable; 

RDP  <=  ’1’  after  ODEL; 
wait  until  not  CLK’stable; 

RDP  <=  ’0’  after  ODEL; 

BUFF.IOC  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 
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—  operand  nas  read  from  data.bus 


elsil  (CMDO  =  »1')  then 


data_bus  was  operand 


—  (4)  essential  operands  to  store  in  GP  registers 


il  (num  <=  4)  then 

Reg_32C(nnjn)  <=  BUFF_I0C  after  FFDEL;  —  load  essential  registers 
wait  for  FFDEL;  --  reg  xfer  delay 

BUFF.IOC  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  1  ns; 

—  resolve  buff 

mnn  :=  num  +1;  —  next  GP  register 


store  additional  operands  in  LP's  RAM  partition 


elsif  (num  >  4)  then 

R®g_32C(8)  <=  BUFF_I0C  after  FFDEL;  —  temp  for  next  operand 

wait  for  FFDEL; 


—  get  pointer  to  LP's  RAM  partition  (this  is  first  add'l  operand) 


if  (num  =5)  then 

MAC  <=  DW0RD_to_MADD(Reg_32(l))  after  MADEL; 

—  LP  RAM  partition  tbl  addr 

RWC  <=  '0'  after  ODEL; 

IOC  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF.IOC  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

Reg_32C(9)  <=  BUFF.IOC  after  FFDEL; 

RWC  <=  >Z>  after  ODEL  +  GDEL; 

IOC  <=  >0>  after  ODEL; 
wait  for  ODEL; 


—  RAM  read  cycle 

—  LP  RAM  base  addr  ptr 

—  for  future  use  (Init_Sim) 

—  must  chg  before  RW 


—  LP's  data  base_addr  in  RAM  +  offset 


Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(BUFF_I0C) ) ) 
BUFF.IOC  <=  Reg_32(8)  after  FFDEL; 
wait  for  FFDEL; 

DATAout  <=  BUFF_I0C  after  FFDEL  +  ODEL,  —  put  operand  on  data_bus 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2+ODEL  +  WDEL; 

MAC  <=  DWORD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 
RWC  <=  '1'  after  3*0DEL;  —  RAM  write  cycle 

IOC  <=  '1'  after  ODEL; 
wait  for  WDEL; 

RWC  <=  'Z'  after  ODEL  +  GDEL; 

IOC  <=  '0'  after  ODEL;  —  must  chg  before  RW 

MAC  <=  "ZZZZZZZZZZZ"  after  ODEL; 

num  :=  num  +1;  —  next  GP  register 


—  next  RAM  address 


num  -  1 ) ) ) 
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Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_Addr)))  +  l))) 
Reg_32C(6)  <=  Next_Addr;  —  needed  in  future  (Init_Sim) 


already  have  LP's  RAM  address:  store  add'l  operands  in  RAM 


elsif  (num  >  5)  then 

wait  for  FFDEL;  —  xfer  of  buff_ioc  to  reg  32(8) 

BUFF.IOC  <=  Reg_32(8)  after  FFDEL; 
wait  for  FFDEL; 

DATAout  <=  BJFF_I0C  after  FFDEL  +  ODEL,  —  put  operand  on  data  bus 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2*0DEL  +  WDEL^ 

MAC  <=  DWORD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 
RWC  <=  '1'  after  3*0DEL;  —  RAM  write  cycle 

IOC  <=  '1'  after  ODEL; 
wait  for  WDEL; 

RWC  <=  'Z'  after  ODEL  +  GDEL; 

IOC  <=  'O'  after  ODEL;  —  must  chg  before  RW 

MAC  <=  "ZZZZZZZZZZZ"  after  ODEL; 

num  :=  num  +1;  —  next  GP  register 

—  next  RAM  address 


Next_Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_addr) ) )  +  1))); 

Reg_32C(6)  <=  Next_Addr;  —  needed  in  future  (Init  Sim) 

end  if ; 
end  if; 
end  if; 


—  toggle  state  for  multiple  CPU.IO  state  operations  in  succession 


wait  until  not  CLK' stable; 
CPU_I0C  <=  false; 
lOWAITC  <=  'O'; 

RDYC  <=  '0'  after  GDEL, 

'1'  after  PER/2  +  GDEL; 
wait  for  ODEL; 


—  synchronization 

—  can  return  to  cpu_io 

—  allow  state  changes 

—  delay  for  synch 


end  process  CPU_I0_PR0C; 

Execute  Process  (DES  operation  execution) 


EXECUTE_PROC :  process 


variable  ARCS_IN_STAT  :  DWORD;  —  status  for  LPs  in 

variable  Next.Addr  ;  DWORD;  -  next  RAM  addr  for  r/w 

—  same  as  CPU_I0_PR0C 

PROCEDURES  AND  FUNCTIONS  required  for  EXECUTE  process 
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—  INTR_CPU_SEND  procedure  (output  to  CPU  with  Interrrupt  Request) 

—  Intr_CPU_Send  pseudo/algorithm 

—  Assert  Tnterrupt  request  to  CPU 

busy  wait  lor  CPU  to  acknowledge  request  (i.e.  control  signals) 

—  Register  (DWORD)  to  send  to  CPU  is  put  on  data.bus 

—  Ensure  ready  line  "RDY"  is  high  lor  CPU  bus  xler  cycle 

—  Assert  parallel  I/O  (write) 

—  Pull  ready  line  "RDY"  low  to  end  CPU  bus  xler  cycle 
procedure  Intr_CPU_Send  (OUTPUT  :  in  DWORD)  is 

variable  MSG_0UT  :  DWORD; 


begin 


HSG.OUT  :=  OUTPUT;  -  output  to  CPU 

BUFF.IOE  <=  MSG.OUT; 

INTR  <=  '1'  alter  ODEL; 

wait  until  ((NPSl  =  '0')  and  (NPS2  =  '!’)  and  (WR  =  '0')); 
INTR  <=  'O’  alter  ODEL; 


—  output  MSG.OUT 


DATAout  <=  BUFF.IOE  alter  ODEL, 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  alter  2*DDEL  +  RDEL- 
WTP  <=  '1'  alter  ODEL;  -  parallel  i/o  write 

wait  lor  RDEL; 

WTP  <=  '0'  alter  ODEL; 

RDYE  <=  '0'  alter  ODEL,  —  end  bus  cycle 

'1'  alter  PER/2  +  ODEL; 

end  Intr_CPU_Send; 

—  SEND_CPU  procedure  (output  to  CPU  without  Interrupt  Request) 

—  Send_CPU  pseudo/algorithm 

—  Register  (DWORD)  to  send  to  CPU  is  put  on  data.bus 

—  Ensure  ready  line  "RDY"  is  high  lor  CPU  bus  xler  cycle 

—  Assert  parallel  I/O  (write) 

—  Pull  ready  line  "RDY"  low  to  end  CPU  bus  xler  cycle 
procedure  Send_CPU  (OUTPUT  :  in  DWORD)  is 

variable  MSG_0UT  ;  DWORD; 


begin 
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—  bus  cycle  strb 

—  output  to  CPU 


RDYE  <=  '1'  after  ODEL; 
MSG.OUT  :=  OUTPUT; 
BUFF.IOE  <=  MSG.OUT; 
wait  for  FFDEL; 


—  output  MSG_0UT 


DATAout  <=  BUFF.IOE  after  ODEL, 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2*0DEL  +  RDEL; 
WTP  <-  '1'  after  ODEL;  —  parallel  write 

wait  for  RDEL; 

WTP  <=  '0'  after  ODEL; 

BDYE  <-  ‘O'  after  ODEL,  --  end  bus  cycle 

'1’  after  PER/2  +  ODEL; 


end  Send_CPU; 

—  SEND_NULL_MSG  procedure  (send  ‘‘null_insg‘‘  to  all  output  arcs) 

—  SEKD_NULL_MSG  pseudo/algor ithm 

procedure  Send_null_msg(RAM_Addr  :  DWORD;  Outputs  :  INTEGER)  is 

constant  NULL.MSG  :  DWORD  :=  "00000000000000000000000000000000"; 
variable  Next.Addr:  DWORD; 
variable  Num.out  ;  INTEGER; 

begin 


Next.Addr  ;=  RAM.Addr; 
Num.out  :=  Outputs; 


—  LP's  ARCS.OUT 

—  #  of  arcs.out 


Send_Null_Loop : 

while  (Num.out  >  0)  loop 


get  next  LP.OUT  from  RAM  (T0_LP  for  "null.msg")  ft  store  (Reg_32(7)) 


MAE  <=  DWORD_to_MADD (Next.Addr)  after  MADEL; 
RWE  <=  ’O'  after  ODEL; 
lOE  <=  ’1’  after  ODEL; 
wait  for  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

Reg.32E(7)  <=  BUFF.IOE  after  FFDEL; 
lOE  <=  'O’  after  ODEL; 

RWE  <=  >Z’  after  ODEL  +  GDEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 
wait  for  ODEL; 


—  addr  for  next  LP.OUT 

—  RAM  read  cycle 

—  next  LP.OUT 

—  next  TO.LP 

—  must  be  after  10 


—  decrement  Next.Addr  in  LP.RAM.Partition  (reading  top  down) 
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Hext.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE 
(Next.Addr)))  -  1))); 

wait  for  ALUDEL; 


—  output  T0_LP  for  "null.msg"  to  CPU 


RDYE  <=  '1'  after  ODEL; 
wait  for  ODEL; 

Intr_CPU_Send(;Reg_32(7));  -  out.node  I  out.LP 

wait  until  READYO  =  'O'; 


—  output  FR0M_LP  for  "null.msg"  to  CPU 


RDYE  <=  '1'  after  ODEL; 
wait  for  ODEL; 

Intr_CPU_Send(Reg_32(l));  —  froM.LP  (this  LP) 

wait  until  READYO  =  'O'; 


"  output  SAFE_LOOKAHEAD_TIME  for  "null.msg"  to  CPU 


RDYE  <=  '1'  after  ODEL; 
wait  for  ODEL; 
Intr_CPU_Send(Reg_32(8)); 
wait  until  READYO  =  'O'; 


—  output  NULL_MSG  (null  value)  to  CPU 


RDYE  <=  '1'  after  ODEL; 
wait  for  ODEL; 

Intr_CPU_Send(NULL_MSG) ; 
wait  until  READYO  =  'O'; 

Num.out  :=  Nuin.out  -  1;  __  ^ext  output  LP 

end  loop  Send_Null_Loop; 

end  Send_null_msg; 

—  INIT_SIM  procedure  (initialize  simulation) 

—  Init_Sim  pseudo  code 

—  GP  registers  loaded  during  CPU_IO_PROC 

"  REG_32-1  <=  T0_LP 
—  REG_32-2  <=  T0_LP (DELAY) 

—  REG_32-3  <=  #  ARCS.IN 
—  REG_32-4  <=  #  ARCS.OUT 
—  RAM.XXXX  <=  IN_1_H0DE#  |  IN_1_LP# 

"  RAM_XXXX  <=  0UT_1_N0DE#  I  0UT_1_LP# 


—  initialize  "this"  LP 

—  this  LPs  delay 

—  (2)  data  in  (1)  reg 

—  add'l  arcs_in 

—  (2)  data  in  (1)  reg 

—  add'l  arcs  out 


—  load  Safe_Time 

—  delay  to  next  out 
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assert  clear/reset  for  CAM  storage 
reset  and  store  sim_clock  for  lp_id 

—  REG_32-5  reserved  for  Local_Sim_Time 
calculate  min_saf e_time  (sim_clock  +  LP_delay) 
send  null_msg  to  all  ARCS_0UT  LPs 


—  assert  m_clr 

—  DWORD 

—  local  sim  clock 

—  dynamic 

—  xmit  bus  cycles 


—  REG_32-6  <=  Next.Addr  (next  free  C  top  of  LP  RAM  partition) 

-  store  LP  significant  data  (regs)  in  LP_RAM_Partion  (Reg_32(l— 4)) 

procedure  Init.Sim  is  —  init  lPs  for  sim 


variable  NUM_0UT  :  INTEGER; 


begin 


NUM.OUT  :=  BVtoI(MVL7VtoBV(CHANGE(Reg_32(4) ) ) ) ;  -  #  ARCS.OUT  (loop  indx) 

RST  <=  'O’  after  ODEL,  —  mstr_clr  for  CAM 

'1'  after  PER/2  +  ODEL; 


Next.Addr  :=  Reg_32(6) ; 

—  initialize  the  FLAGS  reg 

FLAGSEdl  doHnto  0)  <=  "010000111110"; 

FLAGSE(22  downto  12)  <=  "01000011111"; 
FLAGSE(28  downto  23)  <=  "000000"; 

—  reset  LP’s  simulation  clock  (i.e.  start) 


—  saved  in.  CPU_I0_PR0C 


—  tail  of  RAM  queue 

—  and  CAM  "full"  stat 

—  head  of  RAM  queue 

—  #  events  in  RAM 


Reg_32E(5)  <=  "00000000000000000000000000000000"  after  FFDEL; 
wait  for  FFDEL; 


—  add  LP  delay  to  sim.clk  for  safe.time 


ACC  <=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32E(5))))  + 

B/<-oI(MVL7VtoBV(CHANGE(Reg_32(2) )))))) 
after  ALUDEL; 

wait  for  ALUDEL;  _  PER/2 

Reg_32E(8)  <=  ACC  after  FFDEL;  —  safe.time  reg 

—  top  entry  in  LP  RAM 


Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE 

(Next.Addr)))  -  1))); 


wait  for  FFDEL; 


—  LOOP  for  "null_msg"  to  all  ARCS_0UT 


lOWAITE  <=  »1'; 


—  no  state  changes 
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BUSYE  <=  '1'  after  ODEL; 
Send_iiull_msg(Next_Addr,  NUM.OUT) ; 
RDYE  <=  '1'  after  ODEL; 


—  prompt  for  Test_Bench 


—  ready  for  next  proc 

store  essential  LP  data  (4  registers)  for  future  operations 

—  use  LP_RAM_Partition  starting  address  (base.addr)  ->  done  earlier 

—  align  data  in  GP  registers  for  RAM  write  LOOP 

—  NOTE:  Reg_32(2)  already  contains  LP_Delay 

—  Reg_32(3)  :  combine  #  ARCS.IN  ft  #  ARCS.OUT  in  single  DWORD 


Next_Addr  :=  Reg_32(9);  —  LP_RAM_Partition  (base_addr) 

Reg_32E(l)  <=  "00000000000000000000000000000000"  after  FFDEL; 

—  init  ARCS_IN_STATUS 

Reg_32E(3)  <=  J0IN_DW0RDS(Reg_32(3) ,  Reg_32(4))  after  ALUDEL; 
wait  for  ALUDEL; 

Reg_32E(4)  <=  Reg_32(5)  after  FFDEL;  -  LP_Simulation.Time 


—  Loop  to  write  registers  to  LP_RAM_Partition 


lOWAITE  <=  '1' ; 
SAVE_LP_DATA : 


—  no  state  chainges 


for  I  in  1  to  4  loop 

BUFF_I0E  <=  Reg_32(I)  after  FFDEL;  —  next  reg  to  store 

wait  for  FFDEL; 

DATAout  <=  BUFF.IOE  after  FFDEL  +  ODEL,  —  put  data  on  bus 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZLZZ"  after  2*0DEL  +  WDEL; 

MAE  <=  DW0RD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 

RWE  <-  '1'  after  3*0DEL;  —  RAM  write  cycle 

lOE  <=  '1'  after  ODEL; 

wait  for  WDEL; 

lOE  <=  '0'  after  ODEL; 

RWE  <=  'Z'  after  ODEL  +  GDEL; 


MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

wait  for  ODEL;  —  ensure  valid  M_addr 

Next.Addr  ;=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV (CHANGE 
(Next.Addr)))  +  1))); 
end  loop  SAVE_LP_DATA; 

lOWAITE  <-  'O';  —  allow  state  chgs 

BUSYE  <=  '0'  after  ODEL;  —  prompt  for  Test_Bench 


end  Init_Sim; 

—  POST.MSG  procedure  (process  received  event/null  messages) 

—  Post_Msg  pseudo  code 


CPU_I0_PR0C  has  loaded  he  following  registers 

REG_32-1  <=  T0_LP  —  rcvd  (2  this  LP 

—  REG_32-2  <=  IN.NODE  #  |  IN_LP  #  —  FROM.LP 

—  REG_32-3  <=  EVENT  TIME.TAG 

—  REG_32-4  <=  MEMR_PTR  to  msg  —  null_msg  ;  ptr 


0 
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load  GP  registers  with  T0_LP  essential  data  from  RAM  —  read  (4*DW0RD) 
"  update  ARCS_IN  status  (toggle  FRQM_LP  bit) 
check  CAM_FULL  bit  in  FLAGS  register 

if  not  FULL  ->  store  message  t  update  FLAGS  reg 
(T0_LP,  FROM.LP,  TIME.TAG,  MEMR.PTR) 

update  CAM_FULL  bit  in  FLAGS  reg  by  receiving  CAM  status  msg 
else  ->  store  in  RAM  temporarily  ->  START/IDLE  will  handle 


procedure  Post_Msg  is 


—  rev  EVENT/NULL  msg 


constant  Last_q_Addr  :  DWORD 
variable  STAT_BIT  :  ’’NTEGER 

begin 


"0000000000000000000000011011 1111"; 
0;  —  array  index  bit 

—  for  ARCS_IN_STAT 


—  load  LP_GP_working  registers  with  essential  data  from  RAM  Partition 

—  must  get  pointer  to  LP_RAM_Partition  base_addr  first 


lOWAITE  <=  '1'; 

BUSYE  <=  '1'  after  ODEL; 

MAE  <=  DW0RD_to_MADD(Reg_32(l))  after  MADEL; 

RWE  <=  '0'  after  ODEL; 

IQE  <=  '1*  after  ODEL; 

Bait  for  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL; 

Bait  for  FFDEL; 

RWE  <=  'Z’  after  ODEL  +  GDEL; 
lOE  <=  ’0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg_32E(6)  <=  BUFF.IOE  after  FFDEL; 

Next.Addr  :=  BUFF.IOE; 

Bait  for  ODEL; 


—  no  state  changes 

—  prompt  for  Test.Bench 

—  part  tbl  to 

—  base  adaj. 

—  RAM  read  cycle 


—  LP  RAM  base.addr 

—  after  lOE 


—  temp  for  base.addr 

—  increment  in  loop 


loop  to  read  RAM  and  load  LP.GP  registers 

Reg.32(6)  <=  ARCS.IN.STAT  Reg.32(8)  <=  #  ARCS. IN  |  #  ARCS. OUT 

—  Reg.32(7)  <=  LP.DELAY  Reg.32(9)  <=  LP_Simulation_Time~ 


LOAD.REG.LOOP ; 

for  I  in  6  to  9  loop  —  6  to  9:  indx  regs 

MAE  <=  DW0RD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 

RWE  <-  '0'  after  ODEL;  —  rAH  read  cycle 

lOE  <=  '1'  after  ODEL; 

Bait  for  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL;  —  LP  data 

Bait  for  FFDEL; 

RWE  <=  'Z'  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg.32E(I)  <=  BUFF.IOE  after  FFDEL;  —  load  LP  data 

Next.Addr  :=  CHANGE(BVtoMVL7V(ItoRV(BVtoI (MVL7VtoBV(CHANGE(Next_Addr) ) )  +  1))) 
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wait  for  ODEL; 
end  loop  L0AD_REG_L00P ; 


—  update  ARCS_IN_STATUS  (i.e.  set  FROM.LP  bits  (9->0)  high) 

"  Read  ARCS.IN  addresses  (loop  0  to  (#  ARCS.IN  -  1)  times) 

base.addr  +  offset  (4  ess.  data)  in  RAM  Partition  to  LP_IN 

—  Compare  FR0M_LP  address  to  LP_IN  addresses  (same  loop  iteration) 

find  match  then  EXIT 

Set  ARCS_IN_STAT  bit  (bit  #  to  set  =  loop  iteration  #) 

—  start  of  ARCS.IN  addresses  in  LP_RAM_Partition 


Next.Addr  ;=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(5) ) ))  +  4))); 

ARCS_Loop ; 
loop 

MAE  <=  DWORD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 

RWE  <=  '0'  after  ODEL;  -  ram  read  cycle 

lOE  <=  '!»  after  ODEL; 

Bait  for  RDEL; 

BUFF_I0E  <=  DATAin  after  FFDEL;  —  IN  lP  id 

Bait  for  FFDEL; 

RWE  <=  >2‘  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Bait  for  ODEL; 

if  (Reg_32(2)  =  BUFF.IOE)  then 

MATCH  <=  true;  —  exit:  IN_LP=FROM_LP 

exit ; 
end  iC; 

Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Kext_Addr)))  +  !)))■ 

STAT.BIT  ;=  STAT.BIT  +1, 
end  loop  ARCS_Loop; 

Bait  on  MATCH  until  MATCH; 

Reg_32E(6)(STAT_BIT)  <=  '1'; 

MATCH  <=  false; 

Bait  for  FFDEL; 


Update  ARCS_IN_STATUS  stored  in  LP’s  RAM  partition  base_addr 


—  ARCS_IN_STAT(X) 


"  find  FROM.LP  in  RAM 

—  Set  STATUS.BIT 

—  reset  MATCH  signal 


DATAout  <=  Reg_32(6)  after  FFDEL  +  ODEL, 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2*0DEL  +  WDEL; 

MAE  <=  DW0RD_to_MADD(Reg_32(S))  after  MADEL;  —  base.addr 

RWE  <=  '1'  after  3*0DEL;  __  ram  ^rite  cycle 

lOE  <=  '1'  after  ODEL; 

Bait  for  WDEL; 

RWE  <=  >Z>  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 
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wait  lor  ODEL; 


-  post  message  ->  store  event/null  in  CAM 

"  CAM  event  has  (5)  fields  |TO_LP|FROM_KODE_#|FROM_LP_#|TIME_TAG|MEM_PTR| 
filter  first  (3)  fields  for  compact  representation 
“  T0_LP  (32  bits)  ->  5  bits  (20  LPs/node  max) 

F'R0M_N0EE_#  (16  bits)  ->  3  bits  (8  nodes  max  in  cube) 

~  FR0M_LP_#  (16  bits)  ->  S  bits  (20  LPs/node  max) 

—  (3)  CAM  write  cycles:  (13  'valid'  bits),  (32  bits),  (32  bits) 

"  !  !  !  !  IF  CAM  IS  NOT  FULL  ->  STORE  EVENT  ! ! ! ! 

—  else  store  temporarily  in  RAM  and  let  START/IDLE  handle 


R®g_32E(2)  <~  MAP_FIELDS(Reg_32(l) ,  Reg_32(2));  —  compact  event  id 

wait  for  ALUDEL; 


"  CAM  write  loop  (event.id,  event  time_tag,  memr.ptr  to  msg) 


if  FLAGS (0)  =  '0'  then  __  cAM  is  NOT  full 

CAM_SAVE_EVENT; 


for  I  in  2  to  4  loop  —  ^se  GP  reg  indx 

BUFF_I0E  <=  Reg_32(I)  after  FFDEL;  —  next  event  field 

wait  for  FFDEL; 

DATAout  <=  BUFF.IOE  alter  ODEL,  —  put  data  on  bus 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  alter  2*0DEL  +  WDEL; 
wait  for  ODEL; 

NEE  <=  '0'  alter  ODEL; 

WTCE  <=  '1'  after  ODEL; 
wait  for  WDEJ ; 

WTCE  <=  '0'  after  ODEL; 

NEE  <=  'Z'  lifter  ODEL; 
wait  for  ODEL; 
end  loop  CAM..SAVE_EVENT; 


Update  CAM_FULL  bit  (FLAGS(O))  with  automatic  reading  of  CAM  status 


RDCE  <=  '1'  after  ODEL; 
wait  lor  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL; 
RDCE  <=  '0'  after  ODEL; 
wait  lor  ODEL; 
if  BUFF_I0E(0)  =  '1'  then 
FLAGSE(O)  <=  '1'; 
end  if ; 

olsil  FLAGS(O)  =  '1'  then 


—  CAM  read 


—  CAM  is  full 

—  else  NOT  full 
--  CAM  is  full 


"  Store  Post.Msg  in  RAM  (temporarily):  START/IDLE  will  (•',et  cut 
—  Tail  of  RAM  queue  is  in  FLAGS (11  downto  1) 


Next.AddrdO  downto  0)  :=  FLAGS(11  downto  1);  -  Tail  of  RAM  queue 
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—  store  in  RAM 

—  use  GP  reg  indx 

—  next  event  field 


RAM.SAVE.EVEITT: 
lor  I  in  2  to  4  loop 

BUFF.TOE  <=  Reg_32(I)  after  FFDEL; 
wait  for  FFDEL; 

DATAout  <=  BUFF_I0E  alter  FFDEL  +  ODEL,  —  put  data  on  bus 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  2fODEL  +  WDEL; 

MAE  <=  DWORD_to_MADD(Next_Addr)  after  MADEL;  —  RAM  address 
RWE  <=  '1'  after  3*QDEL:  __  ram  write  cycle 

lOE  <=  ’1’  alter  ODEL; 
wait  for  WDEL; 

RWE  <=  ’Z’  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Next.Addr  :=  CHAKGE(BVtoMVL7V(ItoBV(BVtoI(HVL7VtoBV(CHANGE(Next_Addr)))  -  l))); 

wait  for  ODEL; 
end  loop  RAM_SAVE_EVENT; 


—  Update  FLAGS  to  show  events  stored  in  RAM 


FLAGSE(23)  <=  '1'; 


Update  RAM  queue  tail  address  in  FLAGS 


if  Next.Addr  =  Last.Q.Addr  then 
FLAGSEdl  downto  1)  <=  "01000011111"; 
else 

FLAGSEdl  downto  1)  <=  Next.AddrClO  downto  0); 
end  if; 
end  if; 

lOWAITE  <=  'O’; 

BUSYE  <=  '0'  after  ODEL; 

end  Post_Msg; 

—  GET.EVENT  procedure  (get  NEXT.EVENT  for  specified  LP) 

—  Get_Event  pseudo  code 

lp_id  (T0_LP  stored  in  Reg_32d)  from  cpu_io_proc) 

—  if  ARCS_IN  lor  lp_id  satisfied  (msg  on  all  ARCS.IN) 

assert  rdc  (read  cam  -  next  lp_id  event) 

-  update  ARCS_IN  register 

-  update  lp_id  sim_clk  (sim.clk  +  time.tag) 

-  read  event  memr.ptr 

-  update  CAM  (reset  available  bit) 

send  NEXT_EVENT  to  CPU  —  xmit  bus  cycles 

if  NEXT.EVENT  =  NULL 

““  send  null  with  updated  sim.time  to  all  output  arcs 

do  Get_Event  again 

—  else 

send  wait  signal  to  CPU  (no  next.event)  —  xmit  bus  cycles 


—  circular  Q 

—  back  to  top  of  Q 

—  incremented  by  1 

--  allow  state  change 

—  prompt  lor  Test_Bench 
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procediare  Get.Event  is  —  get  NEXT.EVENT  from  CAM 


constant  WAIT_HSG 

variable  NUHARCS.IH 
variable  NUMARCS_OUT 
variable  HAVE.EVENT 
variable  HAD_NULL 
variable  RAH.ADDR 
variable  STAT_BIT 


SYS.BUS  := 

"11110000111100001111000011110000"; 

INTEGER; 

INTEGER; 

BOOLEAN  :=  false;  —  consider  FLAGS  bit  (31-29) 
BOOLEAN  :=  false;  —  consider  FLAGS  bit  (31-29) 
DWORD;  —  start  of  arcs_out  for  null 

INTEGER  :=  0;  —  ARCS_'"N_STAT(#) 


begin 


—  THIS  IS  WHERE  WE  ADVANCE  THE  LOCAL  SIMULATION  CLOCK! I ! 

—  CONTIGENT  UPON  HAVING  A  NEXT  EVENT/NULL  MESSAGE!!! 

—  load  LP_GP_working  registers  with  essential  data  from  RAM  Partition 

—  must  get  pointer  to  LP_RAM_Partition  first 


lOWAITE  <=  *1'; 

BUS YE  <=  '1'  after  ODEL; 

MAE  <=  DW0RD_to_MADD(Reg_32(l))  after  MADEL; 

RWE  <=  '0'  after  ODEL; 
lOE  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

RWE  <=  >Z>  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg_32E(2)  <=  BUFF.IOE; 

Next.Addr  :=  BUFF.IOE; 


—  no  state  changes 

—  part  tbl  ptr  l;o 

—  base  addr 

—  RAM  read  cycle 


—  LP  RAM  base.addr 


—  temp  for  base_addr 

—  increment  in  loop 


loop  to  read  RAM  emd  load  LP_GP  registers 

"  Reg_32(3)  <=  ARCS_IN_STAT  Reg_32(5)  <=  #  ARCS.IN  I  #  ARCS.OUT 

—  Reg_32(4)  <=  LP.DELAY  Reg_32(6)  <=  LP_Simulation_Time 


L0AD_REG_L00P : 
for  I  in  3  to  6  loop 

MAE  <=  DWORD_to_MADD( Next _ Addr)  after  MADEL; 
RWE  <=  '0'  after  ODEL; 
lOE  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF.IOE  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

RWE  <=  ’Z'  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg_32E(I)  <=  BUFF.IOE  after  FFDEL; 


—  3  to  6:  indx  regs 

—  RAM  address 

—  RAM  read  cycle 


—  LP  data 


—  load  LP  data 
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Next_Addr  :=  CHAKGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CiiANGE(Next_Addr) ) )  +  l))) 
—  determine  number  ol  input  ARCS 


if  I  =  5  then 

wait  for  2*FFDEL;  __  load  Reg_32(5) 

NUMARCS.IN  :=  BVtoI(MVL7VtoBV(CHANGE(Reg_32(I) (31  downto  16)))); 
end  if ; 
if  I  =  6  then 

LOADED  <=  true;  __  gp  Regs  loaded 

end  if; 

end  loop  L0AD_REG_L00P ; 


—  check  for  ARCS.IN  satisfied  (logical  AND  of  ARCS_IN_STAT  input  bits) 


wait  on  LOADED  until  LOADED; 

LOADED  <=  false; 

for  I  in  0  to  (NUMARCS.IN  -  1)  loop 
if  Reg_32(3)(I)  =  '0'  then 
HAVE.EVENT  ;=  false; 
exit; 

elsif  Reg_32(3)(I)  =  '1'  then 
HAVE.EVENT  :=  true; 
end  if; 


—  GP  Regs  loaded 

—  reset  flag 

—  NO  next_event 


—  still  checking 


assert  (Reg_32(3) (I)  =  '0'  or  Reg_32(3)(I)  =  '1') 
report  "Invalid  ARCS.IN.STAT"; 
and  loop; 

wait  for  ALUDEL;  —  do  bit  check  above 


either  tell  CPU  to  "wait  ->  no  next_event"  or  get  next_event  from  CAM 
—  tell  CAM  "who"  needs  Next.Event 


if  HAVE.EVENT  then 
BUFF.IOE  <=  Reg_32v,.  after  FFDEL; 
wait  for  FFDEL; 

DATAout  <=  BUFF.IOE  after  ODEL, 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" 
wait  for  ODEL; 

NEE  <=  '1'  after  ODEL; 

WTCE  <=  '1'  after  ODEL; 
wait  for  WDEL; 

WTCE  <=  '0'  after  ODEL; 

NEE  <=  >Z>  after  ODEL; 


—  CAM  for  next_event 

—  tell  CAM  for  "who" 


after  2+ODEL  +  WDEL; 

—  init  Next_Event 
—  CAM  write  cycle 


—  read  Next.Event  from  CAM 


READ_CAM_L00P; 
for  I  in  7  to  9  loop 
RDCE  <=  '1'  after  ODEL; 
wait  for  RDEL; 


—  read  3+DWORD 

—  Reg_32(Indx) 

—  CAM  read  cycle 
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BUFF.IOE  <=  DATAin  after  FFDEL; 
Bait  for  FFDEL; 

RDCE  <=  '0'  after  ODEL; 

Reg_32E(I)  <=  BUFF.IOE  after  FFDEL; 
Bait  for  ODEL; 
end  loop  READ_CAM_L00P; 

LOADED  <=  true ; 

HAD.EVENT  <=  true; 


—  have  next  event 


—  tell  CPU  to  wait 


else 

Send_CPU(WAIT_MSG) ; 
end  if; 


—  output  WAIT_MSG 


—  REMAINING  CODE  EXECUTED  ONLY  WHEN  CAM  HAS  NEXT.EVENT  (including  "nulls") 

"  update  ARCS_IN_STAT  ->  conditional  on  ADD’L_LP_EVENT_PENDING  bit 
--  bit  31  of  first  DWORD  returned  from  CAM:  IF  =  add'l  events  pending 

and  no  update  of  ARCS_IN_STAT  is  required 

—  IF  UPDATE:  must  determine  which  INPUT.LP  provided  next.event 
read  LP_IDs  of  INPUT. ARCS  from  LP_RAM_Partition  and  compare  with 
the  FROM.LP  field  of  next.event  from  CAM 

wait  on  HAD.EVENT  until  HAD.EVENT; 
if  (not  LOADED 'stable  and  LOADED)  then 


LOADED  <=  false; 
HAD.EVENT  <=  false; 
FLAGSE(O)  <=  'O'; 


—  reset  flag 

—  reset  flag 

—  got  event  ->  not  FULL 


unmask  TO  ft  FROM.LP  id  from  CAM  next.event  (Reg_32(7) (12  downto  0)) 


for  I  in  0  to  4  loop 

Reg_32E(10) (I)  <=  Reg_32(7) (I) ;  —  FROM.LP  # 

end  loop; 


for  I  in  6  to  15  loop 
Reg_32E(10)(I)  <=  'O'; 
end  loop; 

for  I  in  5  to  7  loop 
Reg_32E(10)(I+ll)  <=  Reg_32(7) (I) ; 
end  loop; 


—  rest  of  FROM.LP  # 


—  FROM.LP  node  # 


for  I  in  19  to  31  loop 
Reg_32E(10)(I)  <=  'O'; 
end  loop; 

wait  for  ALUDEL; 


—  rest  of  FROM.LP  node  # 

—  above  bit  twiddle 
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—  TO_LP 


lor  I  in  0  to  4  loop 
Reg_32E(7)(I)  <=  Reg_32(7) (1+8) ; 
end  loop; 

lor  I  in  5  to  30  loop 
Reg_32E(7)(I)  <=  'O'; 
end  loop; 

Reg_32E(7)(31)  <=  Reg_32(7) (31) ; 
wait  lor  ALUDEL;  —  above  bit  twiddle 

il  (Reg_32(7) (31)  =  '0')  then  —  DO  UPDATE 


—  31  is  stat  bit 

—  rest  ol  T0_LP 


—  start  ol  ARCS.IN  ids  in  LP_RAM_Partition 


Next_Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(2) )) )  +  4))); 


ARCS_Loop: 

loop 

MAE  <=  DWORD_to_MADD(Next_Addr)  alter  MADEL;  —  RAM  address 
RWE  <=  '0'  alter  ODEL;  —  read  cycle 

lOE  <=  '1'  alter  ODEL; 
wait  lor  RDEL; 

BUFF.IOE  <=  DATAin  alter  FFDEL;  —  in  LP  id 

wait  lor  FFDEL; 

RWE  <=  'Z'  alter  ODEL  +  GDEL; 
lOE  <=  '0'  alter  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  alter  ODEL; 

il  (Reg_32(10)  =  BUFF.IOE)  then  —  IN_LP=FR0M  LP? 

MATCH  <=  true; 
exit ; 
end  il; 


Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_Addr) ) ) 
STAT.BIT  :=  STAT.BIT  +1;  —  ARCS_IN_STAT(#) 

wait  lor  ODEL; 


end  loop  ARCS.Loop; 
wait  on  MATCH  until  MATCH; 
Reg_32E(3)(STAT_BIT)  <=  'O'; 
MATCH  <=  lalse; 


—  Set  STATUS.BIT 

—  reset  llag 


+ 


1))) 


end  il; 


—  il  (Reg_32(7)(31)  =  'C)  —  DO  UPDATE 


update  LP_Sim_Time  ->  jump  local  clock  to  next.event  scheduled  time 


Reg_32E(6)  <=  Reg_32(8)  alter  FFDEL;  -  update  LP.SIM  Time 

wait  lor  FFDEL; 


—  check  next.event  lor  "null":  IF  NULL  ->  do  Get.Event  again 
Also:  build  "null.msg"  lor  later  transmission  to  arcs.out 


B-28 


safe_look_aliead  =  Reg_32(8)  =  siin_time  +  Ip.delay 
number  of  output  arcs  =  Reg_32(5)(15  donnto  0) 
ARCS..0UT  in  RAM  =  base_addr  +  (sum  arcs  in/out)  +  3) 


Next_Event  =  NULL 


if  (BVtoI(MVL7VtoBV(CHANGE(Reg_32(9))))  =  0)  then 
HAD_NULL  true;  —  send  null_msg  later 

ACC  <=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(6))))  + 

BVtoI(MVL7VtoBV(CHANGE(Reg  ,.32(4)  ) ) ) ) ) ) 
after  ALUDEL; 

Bait  for  ALUDEL; 


Reg_32E(8)  <-  ACC  after  FFDEL;  —  null  safe_time 

NUMARCS.OUT  ;=  BVtoI(MVL7VtoBV(CHANGE(Reg_32(5) (15  donnto  0)))); 
Bait  for  ALUDEL; 


ACC  <=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(2))))  + 

BVtoI(MVL7VtoBV(CHANGE(Reg_52(5))))  + 
3)))  after  ALUDEL; 

Bait  for  ALUDEL; 


RAM.ADDR  ;=  ACC; 
if  (Reg_32(7)(31)  =  '1')  then 
Get_Event ; 
else 


—  ARCS_0UT  in  LP  RAM 

—  multiple  events 

—  try  again 


Send_CPU(WAIT_MSG)  ; 

Bait  until  READYO  =  'O'; 

—  must  send  null  MESSAGES  OUT  TOO!!! 

using  RAM  part  base.addr 

get  number  of  ARCS.OUT  from  RAM 

loop  through  ARCS.OUT  and  send  "null_msg" 

end  loop  SEND.NULLS; 


During  subsequent  "Get_Event"  in  above  "if"  we  retrieved  a  "real" 
next_event  — >  Be  currently  update  the  siro.time  and  pass  the  "real" 
event  to  the  CPU.  Must  deal  Bith  the  fact  that  a  "null"  Bas  processed 
first  >  i.e.,  must  send  "nulls"  to  all  output  arcs  to  update 
LP  safetimes  for  the  rest  of  the  simulation. 

end  if; 

elsif  (BVtoI(MVL7VtoBV(CHANGE(Reg_32(9))))  >  0)  then  -  send  CPU  Next.Event 


—  T0_LP  for  Next_Event 


Reg_32E(7)(31)  <=  'O'  after  FFDEL;  -  dear  multiple  bit 

BUFF.IOE  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  FFDEL; 

Bait  for  ODEL; 

Send_CPU(Reg_32(7)) ;  —  send  T0_LP 

Bait  until  READYO  =  'O'; 


—  FR0M_LP  for  Next_Event 


BUFF.IOE  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  FFDEL; 

Bait  for  ODEL; 

Send_CPU(Reg_32(10)) ;  —  load  FR0M_LP 
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—  Memr_PTR  for  Next_Event 


wait  until  READYO  =  'O'; 
wait  for  ODEL; 

BUFF.IOE  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  FFDEL; 
Send_CPU(Reg_32(9)) ;  —  load  MEM.PTR 

wait  until  READYO  =  'O'; 

end  if;  —  if  Next .Event  =  NULL 


—  UPDATE  LP's  RAM  Partition  with  new  ARCS.IN.STATUS  and  new  SIM.TIME 

currently  in  Reg_32(3)  and  Reg_32(6)  respectively  ft  base.addr  =  Reg_32(2) 


•  Reg_32(2) ;  —  HAM  Part  base.addr 

wait  for  FFDEL; 


RAM.UPDATE:  —  update  RAM  partition 

for  I  in  1  to  2  loop  —  use  GP  reg  indx 

BUFF.IOE  <=  Reg.32(I*3)  after  FFDEL;  —  next  event  field 

wait  for  FFDEL; 

UATAout  BUFF.IOE  after  FFDEL  +  ODEL,  —  put  data  on  bus 

"ZZZZZZZZZZZZZZZZZZZZZ7ZZZZZZZZZZ"  after  2*0DEL  +  WDEL; 

MAE  <=  DWORD.to.MADD(Next.Addr)  after  MADEL;  —  RAM  address 

RWE  <=  '1'  after  3*0DEL;  —  RAM  write  cycle 

lOE  <=  '1'  after  ODEL; 
wait  for  WDEL; 

RWE  <=  'Z'  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 
wait  for  ODEL; 


—  offset  address  by  '3'  for  SIM.TIME  location  in  LP's  RAM  partition 


Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV (CHANGE (Next  Addr)))  +  3))); 
if  HAD.NULL  then 
HAD .NULL  ;=  false; 

SEND.NULL  <=  true  after  WDEL  +  FFDEL;  —  finish  RAM.UPDATE 

end  if; 

end  loop  RAM.UPDATE; 


—  if  Get.Event  was  "null"  we  now  send  null.irsg  to  all  arcs.out 


if  (not  SEND.NULL' stable  and  SEND.NULL)  then 
Send.null.msg(RAM.ADDR,  NUMARCS.OUT) ; 
end  if; 

end  if;  —  if  HAVE.EVENT 
lOWAITE  <=  'O'; 

BUS YE  <=  '0'  after  ODEL; 

end  Get.Event; 
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—  POST.EVENT  procedure  (send  EVENT  ft/or  NULL  messages) 

—  Post_Event  pseudo  code 


—  Registers  loaded  during  CPU_I0_PR0C 

--  REG_32-1  <=  T0_LP  (will  be  sender) 
--  REG_32-2  <=  MEMR_PTR  to  event 
—  REG_32-3  <=  OUT.NODE  #  |  OUT.LP  # 
--  update  event.time  (SIM.TIME  +  delay) 

'  -  read  ARCS_0UT  reg  (send  output  to) 

””  if  ARCS_0UT  =  T0_LP  (cycle  all  out  arcs) 
send  EVENT.MSG 


—  LP  with  event 

—  with  CPU  result 

—  T0_LP  for  msg 

—  add  DWORD 


-  (lrom„lp,  to_lp,  time_tag,  event)  —  4+DWORD 

—  else 

send  NULL.MSG 

-  (from_lp,  to_lp,  sale_time)  —  3*DW0RD 

procedure  Post.Event  is  —  send  EVENT/NULL 


msg 


variable  NUMARCS.OUT  ;  INTEGER; 


begin 


--  load  LP_GP_working  registers  with  essential  data 
—  must  get  pointer  to  LP.RAM.Partition  first 

from  RAM  Partition 

lOWAITE  <=  '1'; 

—  no  state  changes 

MAE  <=  DW0RD_to_MADD(Reg_32(l))  after  MADEL; 

—  part  tbl  ptr  to 

—  base  addr 

RWE  <=  'O’  after  ODEL; 
lOE  <=  '1'  after  ODEL; 
wait  for  RDEL; 

RAM  read  cycle 

BUFF.IOE  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

RWE  <=  'Z'  after  ODEL  +  RDEL; 
lOE  <=  '0'  after  ODEL  +  RDEL; 

MAE  <=  "ZZZZZ2ZZZZZ"  after  ODEL; 

—  LP  RAM  base_addr 

Reg_32E(4)  <=  BUFF.IOE; 

—  temp  for  base_addr 

—  don’t  need  ARCS_IN_STATUS  ->  Next.Addr  :=  base.addr  +  (1)  offset 

Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(BUFF_I0E)))  +  : 

wait  for  ALUDEL; 

—  time  to  increment 

loop  to  read  RAM  and  load  LP_GP  registers 

--  Reg_32(5)  <=  LP.DELAY  Reg_32(7)  <= 

—  Reg_32(6)  <=  #  ARCS.IN  I  #  ARCS_0UT 

LP_Simulation_Time 

L0AD_REG_L00P : 
for  I  in  5  to  7  loop 
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—  5  to  7 :  indx  regs 


—  RAM  address 

—  i’AM  read  cycle 


MAE  <=  DWORD_to_MADD(Next_Addr)  alter  MADEL; 

RWE  <=  'O’  after  ODEL; 
lOE  <=  '1'  after  ODEL; 
wait  for  RDEL; 

BUFF_IOE  <=  DATAin  after  FFDEL;  —  LP  data 

wait  for  FFDEL; 

HVIE  <=  'Z'  after  ODEL  +  GDEL; 
lOE  <=  ’O’  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

R6g_32E(I)  <-  BUFF_I0E;  —  load  LP  data 

Next.Addr  :=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_Addr) ) )  +  1))); 
wait  for  FFDEL; 
end  loop  L0AD_REG_L00P; 


—  Calculate  event/null_msg  time_tag  (LP_Sim_Time  +  LP .Delay) 

Reg_32E(8)  <=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(7) ) ) )  +  —  Time  Tag 

BVtoI(MVL7VtoBV(CHANGE(Reg_32(5)) ) ) ) ) ) ; 

wait  for  ALUDEL; 


"  Output  event/null_msg  loop  with  #  ARCS.OUT  iterations 

—  #  ARCS.OUT  is  LO.WORD  of  Reg_32(6)->(15  dounto  0) 

"  Calculate  base.addr  +  offset  for  0UT_LP_ARC  entries  in  RAM 

—  already  have  base.addr  in  Reg_32(4) 

offset  =  tot  #  IN/OUT  ARCS  +  3  (4  -  essential  LP  data  0:3) 
--  Hext.Addr  :=  Reg_(4)  +  sum(HI  t  LO  WORDS  in  Reg_32(6))  +  3 
—  Read  LP  OUT.ARC  from  RAM  and  send  EVENT/KULL.msg  as  required 


Reg_32E(9)  <=  HI_L0_ADD(Reg_32(6)) ;  —  sum  #  IN/DUT  ARCS 

NUMARCS.OUT  :=  BVtoI(MVL7VtoBV(CHANGE(Reg_32(6) (15  dounto  0)))); 
wait  for  ALUDEL; 

ACC  <=  CHANGE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Reg_32(4))))  +  —  LP.Part  base  addr 

BVtoI(MVL7VtoBV(CHANGE(Reg_32(9))))  + 

--  (4)  reserved  RA 

wait  for  ALUDEL; 

Next.Addr  :=  ACC; 


—  LOOP  through  OUT.ARCS  in  RAM,  read  t  send  EVENT  or  NULL.msg 

lOWAITE  <=  ’1’; 

Send_Msg_Loop : 

while  NUMARCS.OUT  >  0  loop 


—  get  next  LP.OUT  from  RAM  (TO.LP  for  "msg")  ft  store  (Reg. 32(9)) 

MAE  <=  DWORD.to.MADD(Next„Addr)  after  MADEL;  —  addr  for  next  LP.OUT 
RWE  <=  ’O’  after  ODEL;  —  raM  read  cycle 

lOE  <=  ’1’  after  ODEL; 
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—  next  LP_OUT 


wait  for  RDEL; 

BUFF_I0E  <=  DATAin  after  FFDEL; 
wait  for  FFDEL; 

RWE  <=  >Z‘  after  ODEL  +  GDEL; 
lOE  <=  '0'  after  ODEL; 

MAE  <=  "ZZZZZZZZZZZ"  after  ODEL; 

Reg_32E(9)  <=  BUFF_I0E  after  FFDEL;  —  temp  for  T0_LP 


—  decrement  Hext_Addr  in  LP_RAM_Partition 


Next.Addr  :=  CHA»GE(BVtoMVL7V(ItoBV(BVtoI(MVL7VtoBV(CHANGE(Next_Addr) ) )  -  1))) 
wait  for  ALUDEL; 


—  T0_LP  output  for  msg 


Intr_CPU_Send(Reg_32(9)); 
wait  until  READYO  =  'O'; 
RDYE  <=  '1'  after  ODEL; 


—  FR0M_LP  output  for  msg 


Intr_CPU_Send(Reg_32(l)) ; 
wait  until  READYO  =  'O'; 
RDYE  <=  '1'  after  ODEL; 


—  T0_LP 

—  end  bus  cycle 


—  FR0M_LP  (this  LP) 

—  end  bus  cycle 


—  TIHE.TAG  output  for  msg  (saf e_lookahead_time  for  null.msg) 


Intr_CPU_Send(Reg_32(8)) ;  —  Time_Tag 

wait  until  READYO  =  'O';  —  end  bus  cycle 

RDYE  <=  '1'  after  ODEL; 


—  condition  (Is  T0_LP  the  designated  LP.OUT  for  event?) 

—  If  YES  ->  send  event  else  "null"  ->  loop  to  next  0UT_ARC 


if  (Reg_32(3)  =  Reg_32(9))  then  —  send  EVENT_MSG 

Intr_CPU_Send(Reg_32(2)) ;  —  PTR  to  EVENT 

wait  until  READYO  =  'O';  —  end  bus  cycle 

RDYE  <=  '1'  after  ODEL; 

else  —  send  "null" 

Intr_CPU_Send( "00000000000000000000000000000000") ; 

wait  until  READYO  =  'O'; 

RDYE  <=  '1'  after  ODEL; 
end  if; 

NUMARC3_0UT  ;=  NUMARCS.OUT  -  1; 


end  loop  Send„Msg_LooT.; 

BUSYE  <=  '0'  aftef  OIEL; 

lOWAITE  <-  'O';  —  allow  state  changes 

end  Post_Event; 
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—  Decode  procedure  (Instr  from  CPU:  H_I0='0';  A15='l';  A2=’0') 

—  Decode  and  call  identified  procedure  for  execution 
**********************♦♦♦*♦♦♦♦**♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*♦♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦*+♦♦ 
procedure  DECODE  (signal  Opcode  ;  MVL7_VECT0R(2  downto  0))  is 

begin 

case  Opcode  is 

when  "000"  =>  Init.Sim;  —  INITIALIZE  SIMULATION 

when  "001"  =>  Post.Msg;  —  POSTJ'SG  (RCV  from  CPU) 

when  "010"  =>  Get_Event;  —  GET_EVENT 

when  "Oil"  =>  Post.Event;  —  POST.EVENT  (SEND  to_LPs) 

when  others  =>  assert  FALSE  report  "Invalid  DES  opcode" 
severity  FAILURE; 

end  case; 
end  DECODE; 

begin 

Bait  on  EXECUTE  until  EXECUTE; 

DEC0DE(0PC0DE(31  donnto  29)); 

DONEE  <=  true; 
end  process  EXECUTE.PROC; 

— ♦*♦♦♦♦♦♦*♦♦♦♦♦♦♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦*♦♦♦♦♦♦♦♦♦♦******************* *******^^^ 

—  Process  Output  Multiplexing 

lOWAIT  <=  lOWAITC  when  not  lOWAITC'quiet  else  —  WAIT  for  state  change 

lOWAITE  when  not  lOWAITE’quiet  else 
lOWAITS  when  not  lOWAITS’quiet  else 
lOWAIT; 

FLAGS  <=  FLAGSE  when  not  FLAGSE'quiet  else  —  FLAGS  register 

FLAGSS  when  not  FLAGSS’ quiet  else 
FLAGS; 

BUSYB  <=  BUSYS  when  noi.  BUSYS’ quiet  else  —  BUSY  signal 

BUSYC  when  not  BUSYC'quiet  else 
BUSYE  when  not  BUSYE'quiet  else 
BUSYB; 

BUSY  <=  BUSYB; 

RDY  <=  RDYC  when  not  RDYC’quiet  else  —  end  CPU  bus  cycle 

RDYE  when  not  RDYE’ quiet  else 
RDY; 

READYO  <=  RDY; 

BMA  <=  MAS  when  not  MAS 'quiet  else  —  RAM  address 

MAC  when  not  MAC 'quiet  else 


—  decode  &  exec  DES  funcs 

—  opcode  executed 
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KAE  when  not  MAE’ quiet  else 
BMA; 

MA  <=  BMA; 

BRW  <=  RWS  when  not  RWS’ quiet  else 
RWC  when  not  RWC’ quiet  else 
RWE  when  not  RWE’ quiet  else 
BRW; 

RW  <=  BRW; 

BIO  <=  lOS  when  not  lOS’ quiet  else 
IOC  when  not  IOC 'quiet  else 
lOE  when  not  lOE’quiet  else 
BIO; 

10  <=  BIO; 

RDCB  <=  RDCS  when  not  RDCS’ quiet  else 
RDCE  nhen  not  RDCE’ quiet  else 
RDCB; 

RDC  <=  RDCB; 

WTCB  <=  WTCS  when  not  WTCS’ quiet  else 
WTCE  when  not  WTCE’ quiet  else 
WTCB; 

WTC  <=  WTCB; 

NEB  <=  NES  when  not  NES 'quiet  else 
KEE  when  not  NEE 'quiet  else 
NEB; 

NE  <=  NEB; 


—  RAM  Read/Write 


—  RAM  I/O 


—  CAM  read 


—  CAM  write 


—  CAM  new  event 


BUFF.IO  <=  BUFF.IOC  when  not  BUFF.IOC’ quiet  else  —  10  buffer  register 

BUFF_I0E  when  not  BUFF_I0E’ quiet  else 
BUFF_I0S  when  not  BUFF_I0S’ quiet  else 
BUFF.IO; 

CPU_I0  <=  CPU.IOC  when  not  CPU_IOC’quiet  else  —  CPU  10  state 

CPU.IOS  when  not  CPU.IOS’quiet  else 
CPU.IO; 


DONE  <=  DONEC  when  not  DONEC quiet  else  —  DONE  flag 

DONEE  when  not  DONEE’ quiet  else 
DONE; 

Reg_32(l)  <=  Reg_32S(l)  when  not  Reg_32S(l) 'quiet  else  —  GP  registers 
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Reg_32C(l)  Hhen  not  Reg_32C(l) 'quiet  else 
Reg_32E(l)  when  not  Reg_32E(l) ' quiet  else 
Reg_32(l); 

Reg_32(2)  <=  Reg_32S(2)  nhen  not  Reg_32S(2) ’ quiet  else 

Reg_32C(2)  nhen  not  Reg_32C(2) 'quiet  else 

Reg_32E(2)  uhen  not  Reg_32E(2) 'quiet  else 

Reg_32(2); 

Reg_32(3)  <=  Reg_32S(3)  when  not  Reg_32S(3) ' quiet  else 

Reg_32C(3)  when  not  Reg_32C(3) 'quiet  else 

Reg_32E(3)  when  not  Reg_32E(3) 'quiet  else 

Reg_32(3); 

Reg_32(4)  <=  Reg_32S(4)  when  not  Reg_32S(4) ' quiet  else 
Reg_32C(4)  when  not  Reg_32C(4) 'quiet  else 
R6g_32E(4)  when  not  Reg_32E(4) ’ quiet  else 
Reg_32(4); 

^®g-32(S)  <=  Reg_32S(S)  when  not  Reg_32S(5) 'quiet  else 
Reg_32C(S)  when  not  Reg_32C(6) 'quiet  else 
Reg_32E(E)  when  not  Reg_32F(6) ' quiet  else 
Reg_32(S); 

Reg_32(6)  <=  Reg_32S(6)  when  not  Reg_32S(6) 'quiet  else 
Reg_32C(6)  when  not  Reg_32C(6) 'quiet  else 
Reg_32E(6)  when  not  Reg_32E(6) 'quiet  else 
Reg_32(6); 

Reg_32(7)  <=  Reg_32S(7)  when  not  Reg_32S(7) 'quiet  else 
Reg_32C(7)  when  not  Reg_32C(7) ' quiet  else 
Reg_32E(7)  when  not  Reg_32E(7) 'quiet  else 
Reg_32(7); 

Reg_32(8)  <=  Reg_32S(8)  when  not  Reg_32S(8) 'quiet  else 
Reg_32C(8)  when  not  Reg_32C(8) ' quiet  else 
Reg_32E(8)  when  not  Reg_32E(8) 'quiet  else 
Reg_32(8); 

Reg_32(9)  <=  Reg_32S(9)  when  not  Reg_32S(9) 'quiet  else 
Reg_32C(9)  when  not  Reg_32C(9) 'quiet  else 
Reg_32E(9)  when  not  Reg_32E(9) 'quiet  else 
Reg_32(9); 

Reg_32(10)  <=  Reg_32S(10)  when  not  Reg_32S( 10) 'quiet  els 
Reg_32C(10)  when  not  Reg_32C(10) ' quiet  els 
R6g_32E(10)  when  not  Reg_32E(10) ' quiet  els 
Reg_32(10); 


end  BEHAVIOR; 


B.3  Parallel  I/O  Behavior 

This  appendix  provides  the  source  listi  ng  c'  the  VHDL  architectural  behavior  of 
the  parallel  I/O  ports  used  in  the  DES  coprocessor  system.  The  behavior  is  taken  from 
Armstrong’s  chip-level  model  of  the  Mark  II  processor  (2:120-123).  The  parallel  I/O  com¬ 
ponent  instantiation  of  the  DES  coprocessor  does  not  use  the  interrupt  line  provided  with 
this  behavior. 
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—  FILE:  par_port_beh . vhd 
“  AUTHOR;  Paul  J.  Taylor 

“  PURPOSE:  Architectural  BEHAVIOR  of  the  PARALLEL  PORT 

—  REFERENCE:  Chip-Level  Modeling  sith  VHDL 

(James  Armstrong  pp.  120-123) 


library  ZYCAD; 

library  DESIGN; 

use  ZYCAD. TYPES. all; 

use  ZYCAD. BV_ ARITHMETIC. all; 

use  WORK. all; 

use  WORK. SYSTEM. all; 

entity  PAR  is 

generic(GDEL,  FFDEL,  BUFDEL:  TIME); 
portC  DI:  in  DWORD; 

DO:  out  DWORD ;=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" ; 
NDSl,  DS2,  MD,  NCLR;  in  MVL7; 

STB:  in  MVL7; 

NINT:  out  MVL7); 

end  PAR; 


architecture  BEHAVIOR  of  PAR  is 

signal  SO,  SI,  S2,  S3:  MVL7; 
signal  SRQ:  MVL7; 

signal  q,  qi,  q2:  DWORD :=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" ; 
begin 

S:  block  (SI  =  '1'  and  NCLR  =  '1') 
begin 

qi  <=  guarded  DI  alter  FFDEL; 

q2  <=  "00000000000000000000000000000000"  after  FFDEL  when  (NCLR='0') 
else  q2; 

DO  <=  q  after  BUFDEL  when  (S3  =  '1') 
else  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  BUFDEL; 
end  block; 

50  <=  not  NDSl  and  DS2  after  GDEL; 

51  <=  (SO  and  MD)  or  (STB  and  not  MD)  after  (2+GDEL); 

52  <=  SO  or  not  NCLR  after  GDEL; 

53  <=  SO  or  MD  after  GDEL; 

SERVRq:  process  (S2,  STB) 
begin 

if  (S2  =  '0')  then 

SRq  <=  '1'  after  FFDEL; 

elsif  (S2  =  '1’)  and  (not  STB ’STABLE)  and  (STB  =  '0')  then 
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SRQ  <=  'O’  after  FFDEL; 
else 

SRQ  <=  SRQ; 
end  if; 

end  process  SERVRQ; 


HINT  <=  not  SRQ  nor  SO  after  GDEL; 
Q  <=  qi  when  not  Ql' QUIET  else 
q2  when  not  q2’ QUIET  else 

Q: 

end  BEHAVIOR; 
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B.4  RAM  Memory  Behavior 

The  RAM  memory  behavior  is  shown  in  this  appendix.  The  basic  operation  follows 
that  of  an  example  RAM  memory  included  with  the  Zycad  VHDL  system  (32:10-51,  10- 
53).  The  behavior  includes  procedures  for  both  read  and  write  operations  and  is  initialized 
with  the  RAM  partition  pointer  table  via  a  file  read  operation  provided  by  the  standard 
textio  package. 
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—  FILE:  rain_mein_beh .  vhd 

—  AUTHOR:  JT,  PH,  GWH 

—  PURPOSE:  Architectural  BEHAVIOR  of  the  RAM_MEM 

—  DATE:  27  Aug  91 

—  SOURCE:  ZYCAD  User's  Manual  pp.  10-51  —  10-53 

—  HISTORY:  None 


library  ZYCAD; 

librea-y  DESIGN; 

use  ZYCAD. TYPES. all; 

use  ZYCAD.BV. ARITHMETIC. all; 

use  WORK. all; 

use  WORK. SYSTEM. all; 

use  WORK. BUS.SYS. all; 

use  STD.TEXTIO.all; 


entity  RAM.MEM  is 

generic(Ndata:  Positive; 

Naddr:  Positive; 

RDEL,  DISDEL:  TIME); 
port(DATAI:  in  DWORD; 

DATAO:  out  DWORD := 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"; 
ADDR:  in  MVL7_VECT0R(Naddr-l  downto  0) ; 
CE:  in  MVL7; 

RW:  in  MVL7); 


end  RAM.MEM; 


—  #  of  data  lines 

—  #  of  addr  lines 

—  delay  times 

—  data  in  lines 

—  data  out  lines 

—  address  lines 

—  chip  enable  (high) 

—  read  (low)  and 

—  write  (high) 


architecture  BEHAVIOR  of  RAM.MEM  is 
begin 


—  assertion  for  changes  in  address  lines 


Assertion:  process 
begin 

assert  not(RW='l'  and  CE  =  '1'  and  ADDR' EVENT) 

report  "Address  lines  changed  while  RAM  is  Write  Enabled" 
severity  WARNING; 
wait  on  CE,  RW,  ADDR; 
end  process  Assertion; 


—  the  memory  model 


P:  process 

subtype  ETYPE  is  DWORD; 

type  MEMTYPE  is  array  (Natural  range  <>)  of  ETYPE; 
variable  m;  MEMTYPE(0  to  2++ADDR' length-1) ; 

variable  TEMP  :  BIT_VECT0R(31  downto  0);  — —  for  init  only 

file  RAMDEF  TEXT  IS  IN  "ram_text";  —  memory  dotTn©  fii© 
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variable  STARTUP  :  BOOLEAN  :=  true; 
variable  L  :  LINE; 
variable  J  :  INTEGER; 


—  memory  need  init? 

—  current  input  line  from  file 

—  RAM  index 


procedure  do_read  is 
begin 

for  i  in  ADDR 'RANGE  loop 

if  ADDR(i)  =-  'X'  or  ADDR(i)  =  'Z'  then 
DATAO  <=  (others  =>  'Z'); 
return; 
end  if; 
end  loop; 

DATAO  <=  m(BVtoI(HVL7VtoBV(ADDR))); 
end  do_read; 


procedure  do_Hrite(data:  ETYPE)  is 
begin 

for  i  in  ADDR 'RANGE  loop 

if  ADDR(i)  =  'X'  or  ADDR(i)  =  'Z'  then 
assert  false 

report  "Attempted  write  to  bad  RAH  address" 
severity  WARNING; 

DATAO  <=  (others  =>  'Z'); 
return; 
end  if; 
end  loop; 

m(BVtoI(MVL7VtoBV(ADDR)))  :=  data; 

DATAO  <=  data; 
end  do.write; 


begin 

’  The  variable  m  and  the  port  DATAO  are  initialized  correctly 
at  elaboration  time.  This  mahes  it  best  to  wait  at  the  top 
—  of  the  process. 


if  STARTUP  then 

for  J  in  0  to  19  loop 
readline (RAMDEF,  L) ; 
read(L,  TEMP); 

m(J)  :=  CHANGE (BVtoMVL7V (TEMP)); 
end  loop; 

STARTUP  :=  false; 
end  if; 


—  Initialize  the  RAM  only  once 


—  Loading  RAM 


wait  on  CE,  RW,  DATAI,  ADDR; 

if  (CE  =  '1')  then 

if  DATAI 'EVENT  and  RW  =  '1'  then 
do.write (DATAI) ; 

elsif  ADDR 'EVENT  and  RW  /=  'X'  and  RW  /=  'Z'  then 
if  RW  =  '1'  then 
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do_Hrite(DATAI) ; 
else 

do_read; 
end  if ; 

elsif  RW 'EVENT  then 
if  RW  =  '1'  then 
do_Hrite(DATAI) ; 
elsif  RW  =  '0'  then 
do_read; 
else 

do_Hrite((DATAI 'RANGE  =>  'X')) 
end  if; 

end  if; 
else 

DATAO  <=  (others  =>  'Z'); 
end  if; 

end  process  P; 
end  BEHAVIOR; 
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B.5  CAM  Memory  Behavior 

The  source  code  listing  for  the  behvior  of  the  CAM  memory,  described  in  Section 
4. 3.4. 3,  is  included  in  this  appendix.  A  single  CAM  process  is  regulated  by  control  signals 
from  the  DES  coprocessor.  The  CAM’s  operation  is  sensitive  to  both  read  and  write  control 
lines,  which  are  constantly  monitored  for  changes. 

The  CAM  maintains  an  array  of  events,  each  78  bits  in  width,  by  writing  DES 
coprocessor  inputs  in  the  first  available  cell  and  provides  the  next  event  when  read  by  the 
DES  coprocessor.  Reads  require  both  an  identical  match  (i.e.,  designated  TO.LP)  and  a 
less  than”  comparison  (i.e.,  earliest  time)  of  valid  entries  to  retrieve  the  next  event  for 
CPU  execution.  Additionally,  the  CAM  provides  a  free-space  status  to  the  DES  coprocessor 
after  each  write  and  a  multiple  message  indication,  if  the  next  event  was  provided  by  an 
LP  that  has  additional  messages  waiting  for  execution  in  the  CAM. 
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—  FILE:  CAM.BEH.vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  Architectural  BEHAVIOR  of  the  DES  CAM 

—  Overview  of  CAM  operation: 

—  Must  store  all  events  for  all  LPs  (20  max  per  node) 

—  Each  LP  is  limited  to  10  Input_Arcs  max 

—  Using  "median"  values:  (10  LP/node)(5  input/LP)  =  50  inputs/node 

—  Assuming  an  average  of  10  events/LP  pending  =>  CAM  capacity  >=  500  events 

—  Constraints: 

Event  fields/CAM  row  (or  row)  "  80  bits  =  8  bytes 
I  valid  bit  I  T0_LP  |  FROM.NODE  I  FROM.LP  |  TIME.TAG  |  MEMR.PTR  | 

I  1  bit  I  5  bits  I  3  bits  |  5  bits  |  32  bits  |  32  bits  | 

I  77  176  72171  69|68  64|63  32|31  0 

1024  bytes  =>  128  events  2048  bytes  =>  256  events 

4096  bytes  =>  512  events  8192  bytes  =>  1024  events 

512  addresses  =>  10  address  bits  (to  store  512  events) 

—  Considerations: 

VALID  BIT:  set  "high  =  '1’"  as  event  is  read/stored  by  CAM 

Toggle  "low  =  '0'"  when  event  is  written/sent  by  CAM 

Logical  "AND"  of  all  valid  bits  will  determine  when  CAM  is  full 

"  MULTIPLE  BIT:  set  "high  =  '1'"  if,  during  search  for  Next.Event,  multiple 
matches  of  both  T0_LP  and  FROM.LP  fields  occurs. 

Bit  31  of  first  DWORD  sent  to  DES  coprocessor  during  Get_Ever,t 
Doesn't  require  extra  field  in  CAM  row;  rather,  it's  a  fill 
bit  for  the  first  DWORD  sent  to  DES  coprocessor. 

Determine  CAM  full  status  after  storing  new  event 

DES  coprocesor  will  read  CAM  status  after  event  store  (to  update  DES  flags) 

"  Might  consider  adding  "control"  line  to  distinguish  "status"  from  "event"  read 

library  ZYCAD; 

library  DESIGN; 

use  ZYCAD. TYPES. all; 

use  ZYCAD. BV_ARITHMETIC. all; 

use  WORK. all; 

use  WORK . SYSTEM . all ; 

use  WORK. BUS.SYS. all; 


—  CAM  ENTITY 


entity  C_MEM  is 

generic(RDEL,  WDEL,  DISDEL:  TIME) ; 
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portCDATAinto  :  in  DWORD; 

DATAoutof  :  out  DWORD :=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" ; 

CLK  :  in  BIT; 

DSl,  NDS2,  MODE,  N_CLR  :  in  MVL7); 

end  C_MEM; 

—  CAM  Pseudo/Algorithm 

--  Write  to  CAM:  (DES  Post.Msg) 

Read  DATA_BUS  lor  (3)  successive  DWORDS 

Traverse  array  of  CAM  rows  till  first  "invalid"  (i.e.  bit  77  =  '0') 
set  bit  77  =  '1'  (i.e.  valid) 

save  this  location  (address)  for  successive  writes/stores 
Mask  least  significant  13  bits  of  1st  DWORD  in,  and  store  in  76  downto  64 
Store  2nd  DWORD  in  63  downto  32 
Store  3rd  DWORD  in  31  downto  0 
Check  CAM_FULL  status 

logical  "AND"  of  all  CAM  row  "valid"  bits  ('1'  =>  FULL) 
for  loop  (size  of  CAM)  exit  on  any  "not  FULL" 

Send  CAM_STATUS  to  DES  coprocessor  ;  write  immediately  after  read 

—  Read  from  CAM:  (DES  Get.Event) 

Read  DATA_BUS  and  buffer  "who"  (T0_LP)  requires  Next.Event 
Traverse  array  of  CAM  rows 

search  for  match  on  "T0_LP"  field  ("valid"  bit  77  =  '1') 

Traverse  array  of  matches  for  earliest  TIME.TAG  (Next_Event) :  NOTE  ADDRESS 
NOTE:  additional  pass  searching  for  FROM.LP  matches  with  Next.Event 
if  multiple  matches  ->  set  MULTIPLE  bit  "high  =  '1'" 

Save  Next.Event  Address  (send  this  row)  ->  Toggle  "valid"  bit  to  '0' 

Send  Next_Event  (3+DWORD)  to  DES  coprocessor 

construct  1st  DWORD:  bit31  =  MULTIPLE  bit;  b.itt,-  ■'2  doJn^o  o  --  ro/FROr’  Lr 
2nd  DWORD:  TIME.TAG;  3rd  DWC;’!).  ,PTR  (Ijec^F'  -f  .t 


—  CAM  ARCHITECTURE 


architecture  BEHAVIOR  of  C_MEM  is 
begin 

CAM_PR0C  :  process 

subtype  ADDR  is  MVL7_VECT0R(9  downto  0);  —  512  event  addr 

subtype  ROW  is  MVL7_ VECTOR (77  downto  0);  —  event  store 

type  MEMTYPE  is  array  (Natural  range  <>)  of  ROW; 

variable  EVENT:  MEMTYPE(0  to  2**ADDR’length-l) : = 

MEMTYPE' (0  to  2**ADDR' length-1  => 

"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ") 

variable  Next .Event  :  ROW;  __  send  to  DES 
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variable 

FULL 

BOOLEAN 

:=  false; 

variable 

SEND.STAT 

BOOLEAN 

:=  false; 

variable 

EVENT. ID_ REG 

DWORD; 

—  hold  TO.LP 

variable 

HAVE.ADDR 

BOOLEAN 

:=  false; 

—  new  event  location 

variable 

ROW.LOC 

INTEGER 

:=  0; 

—  event  addr  index 

variable 

variable 

EARLIEST.TIME 

EARLIEST.ADDR 

INTEGER: 

INTEGER; 

=  2147483647; 

—  7FFFFFFF  (max  time) 

variable 

EVENT.SEG 

INTEGER 

:=  1; 

—  event  segment  index 

variable 

HAVE.EVENT 

BOOLEAN 

:=  false; 

—  have  next  event 

variable 

MULTIPLE 

BOOLEAN 

false; 

—  have  multiple? 

begin 

wait  on  DSl,  NDS2,  MODE,  N.CLR; 


—  Clear  CAM  by  reseting  bit  77  in  all  event 


if  N_CLR  =  '0'  then  —  clear  CAM 

for  I  in  0  to  2**ADDR'length-l  loop 

EVENT(I)(77)  :=  'O';  —  reset  "valid"  bits 

end  loop; 
end  if; 


—  Write  EVENT  into  CAM 


if  (NDS2  =  '1'  and  MODE  =  '0')  then 

if  not  HAVE.ADDR  then 
FREE.SPACE.LOOP : 

for  I  in  1  to  2**ADDR' length-1  loop 
ROW.LOC  :=  ROW.LO"  +  I; 
if  £Vt,.:ir(R0W_L0C)(77)  =  '0'  then 
EVENr(R0W_L0C)(77)  :=  '1'; 
HAVE_ADDR  :=  true; 
exit; 
end  if; 

end  loop  FREE_SPACE_L00P ; 
end  if ; 


—  WTC  =  '1'  and  NE  =  '0' 


—  traverse  all  CAM  rows 


—  "FREE"  event  space 

—  use  this  address 


—  HAVE_ADDR  for  event 


—  Wait  for  event  address  then  Read  data_bus  for  3*DW0RD  event  and  store  in  CAM 


if  HAVE.ADDR  then 
case  EVENT_SEG  is 


—  Store  first  event  field  (event_id)  and  toggle  "HAVE_ADDR" 


when  1  =>  EVENT(R0W_L0C)(76  downto  64)  :=  CHANGE (DATAinto( 12  downto  ; 
EVENT.SEG  :=  EVENT.SEG  +  1; 


—  Store  second  event  field  (time.tag) 
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when  2  =>  EVENT(ROW_LOC) (63  downto  32)  :=  CHANGE(DATAinto) ; 
EVENT.SEG  :=  EVENT.SEG  +  1; 


Store  third  event  field  (memr_ptr) 


when  3  =>  EVENT(R0W_L0C) (31  downto  0) 
EVENT.SEG  ;=  1; 

SEND.STAT  :=  true; 

HAVE.ADDR  :=  false; 


CHAJi.;E(DATAinto)  ; 


report  status  to  coprocessor 


—  EVENT  is  stored,  must  check  CAM_FULL  status 


CHECK_CAM_FULL_L00P : 
for  I  in  R0W_L0C  to  2++ADDR’ length-1 
if  EVENKI)  (77)  =  '0'  then 
FULL  ;=  false; 
exit ; 

elsif  EVENKI)  (77)  =  '1'  then 
FULL  ;=  true; 
end  if; 

end  loop  CHECK_CAM_FULL_L00P; 


—  done  in  paral] el 

loop  —  full  to  ROW.LOC 

—  row  not  in  use 

—  row  in  use 

—  keep  checking 


when  others  =>  assert  FALSE  report  "Invalid  CAM  Write" 
severity  FAILURE; 


end  case; 
end  if; 
end  if; 


—  if  HAVE.ADDR 

—  if  (NDS2  =  '1'  and  MODE  =  '0') 


—  Report  CAM.FULL  status  out  to  DES  coprocessor 


if  SEND.STAT  then 


wait  until  DSl  =  '1'; 
if  FULL  then 

DATAoutof  <=  "00000000000000000000000000000001" 
"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" 

else 

DATAoutof  <=  "00000000000000000000000000000000" 
"ZZZZZZZZZZZZZZZZZZZZ7ZZZZZZZZZZZ" 

end  if; 

SEND.STAT  :=  false; 
end  if; 


—  RDC  =  '1' 

—  send  FULL  stat  msg 
after  FFDEL, 

after  RDEL; 

—  send  NOT  FULL  stat 
after  FFDEL, 

after  RDEL; 


msg 


—  READ  EVENT  from  CAM 


—  wait  on  MODE  until  MODE  =  '!>; 
if  (NDS2  =  '1'  and  MODE  =  '1')  then 
EVENT^TD.REG  :=  UATAinto; 


—  read  event  coming 

—  WTC=  '1'  and  NE=  '1' 

—  T0_LP  for  Next_Event 


—  traverse  array  of  CAM  rows  (actually  done  in  parallel!) 

—  Match  Conditions: 

must  be  valid  event  -  bit  (77)  =  '1' 
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must  be  lor  T0_LP  -  bits  (76  doonto  72)  match  TO.LP's  (count  matches) 
must  be  earliest  time  -  bits  (63  dounto  32)  are  smallest 


NEXT_EVENT_L00P : 

lor  I  in  0  to  2**ADDR' length-1  loop 

ii  (EVENT(T)(77)  =  '1'  and  (EVENT(I)(76  dounto  72)  = 

CHANGE(EVENT_ID_REG(4  downto  0))))  then 
il  EARLIEST.TIME  >  BVtoI(MVL7VtoBV(EVENT(I) (63  downto  32)))  then 
EARLIEST.TIME  :=  BVtoI(MVL7VtoBV(EVENT(I) (63  downto  32))); 
EARLIEST_ADDR  :=  I; 
end  il; 

Next.Event  :=  EVENT(EARLIEST_ADDR) ; 
end  il; 

il  I  =  2**ADDR' length-1  then 
HAVE.EVENT  :=  true; 

EVENT (EARLIEST.ADDR) (77)  ;=  *0’;  —  "used" 

Next_Event (77)  :=  'O’;  —  assume  "NO"  multiples 

end  il ; 

end  loop  NEXT.EViNT.LOOP ; 

Bait  lor  FFDEL;  —  toggle  bit(77)  above 


Search  matches  Irom  above  looking  lor  multiple  occurences  ol  same  FR0M_LP 
—  lor  updating  ARCS_IN_STATUS 


il  (MODE  =  '1'  and  HAVE.EVENT)  then 
MULTIPLE.EVENT.LOOP : 
lor  I  in  0  to  2**ADDR' length-1  loop 
il  (EVENT(I)(77)  =  and  (EVENT(I)(76  dounto  64)  = 

Next_Event(76  downto  64)))  then 

—  Multiple  FR0M_LP  events 
DATAoutol(’31  downto  13)  <=  "1000000000000000000"; 

—  toggle  multiple  event 

else 

DATAoutol(31  downto  13)  <=  "0000000000000000000"; 
end  il;  —  no  multiple  events 

end  loop  MULTIPLE.EVENT.LOOP; 

wait  lor  FFDEL;  —  toggle  MSB  bit  above 

end  il;  —  if  (MODE  --  '1'  and  EAVE.EVENT) 

end  il;  —  if  (WTC“  '1'  and  NE=  '1') 

—  Send  Next .Event  to  DES  (to  include  MULTIPLE  status) 

il  (DSl  =  '1'  and  HAVE.EVENT)  then  —  RDC  =  '1' 

—  MUST  LOOP  FOR  3*DW0RD  OUTPUT 


case  EVENT.SEG  is 


—  send  1st  part  ol  Next.Event  (event.id) 


when  1  =>  DATAoutol(12  downto  0)  <=  CHANGE(Next_Event(76  downto  64)); 
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wait  lor  FFDEL; 


—  send  2nd  part  oi  Next_Event  (time_tag) 


when  2  =>  DATAoutof  <=  CHANGE(Next_Event(63  doonto  32)); 
wait  lor  FFDEL; 


send  3rd  part  ol  Next_Event  (memory  pointer) 


when  3  =>  DATAoutol  <=  CHANGE(Next_Event(31  doonto  0)); 

Bait  for  FFDEL; 

EVENT_SEG  :=  0;  —  reset  lor  next 

HAVE.EVENT  :=  false;  —  . 

when  others  =>  assert  FALSE  report  "Invalid  CAM  Read" 
severity  FAILURE; 

end  case; 

EVENT.SEG  :=  EVENT.SEG  +  i; 

DATAoutof  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"  after  RDEL; 
end  if ; 

end  process  CAM_PR0C; 
end  BEHAVIOR; 
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Appendix  C.  DES  Coprocessor  System  Test 


This  appendix  contains  the  DES  coprocessor  system  configuration,  the  system  test- 
bench  used  to  verify  the  DES  coprocessor  operation,  and  the  CPU  driver  for  testbench 
stimulation. 
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C.l  DBS  System  Configuration 

The  DES  coprocessor  system  configuration  is  given  in  this  appendix.  The  configura¬ 
tion  encapsulates  the  entity  behviors  required  by  the  testbench.  The  source  code  listings 
for  the  entity  behaviors  is  included  in  Appendix  B. 


C-2 


library  ZYCAD; 

use  ZYCAD . TYPES . all ; 

use  WORK. all; 

use  WORK. SYSTEM. all; 

use  WORK. BUS. SYS. all; 

configuration  Des.system  of  Des.sys.test.bench  is 
for  test 

for  CLOCK.CKT:  Sys.clk 
use  entity  WORK.CLOCK_CKT(BEHAVIOR)  ; 
end  for; 

for  CPU:  CPU.driver 
use  entity  WORK. CPU.driver (BEHAVIOR) ; 
end  for; 

for  COPROC:  DES.sys 
use  entity  WORK.DES.sys(CHIP.LEVEL) ; 
end  for; 
end  for; 
end  Des.system; 
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C.2  DES  Sytem  Test  Bench 

The  testbench  entity  for  the  DES  coprocssor  system  is  contained  in  this  appendix. 
The  three  components  making  up  the  testbench  (i.e.,  DES  system,  CPU  driver,  and  Clock) 
are  declared  in  the  architectural  body.  The  signal  mapping  between  components  is  included 
to  provide  the  testbench  interconnections  shown  in  Figure  5.1.  A  stopping  process  is  also 
included  to  prevent  simulation  runaway.  Simulation  runtimes  can  be  varied  by  adjusting 
the  stopjsim  time  accordingly. 
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—  FILE:  Des_sys_test_bench. vhd 

—  AUTHOR:  Paul  J.  Taylor 

PURPOSE:  The  test,_beiich  lor  the  DES  coprocessor  system 


library  ZYCAD; 
library  DESIGN; 
use  ZYCAD. TYPES. all; 
use  WORK. all; 
use  WORK. SYSTEM. all; 
use  WORK. BUS_SYS. all; 


—  THE  ENTITY  DECLARATION: 


entity  des_sys_test_bench  is 
end  des_sys_test_bench; 


—  THE  ARCHITECTURAL  BODY: 


architecture  test  of  des_sys_test_bench  is 


component  DES_sys 
port (RUN 

CLK  : 
RESETIN  : 
WR 
NPSl 

NPS2  : 
CMDO  : 
INTR  : 
READYO  : 
BUSY  : 
ERROR 
ADD.STR  : 
SYSIN 
SYSOUT  : 
end  component; 


in  MVL7; 
in  BIT; 
in  MVL7; 
in  MVL7; 
in  MVL7: 
in  MVL7; 
in  MVL7; 
out  MVL7; 
inout  MVL7 ; 
out  MVL7; 
out  MVL7; 
in  MVL7: 
in  DWORD; 
out  DWORD); 


component  CPU_driver 


[RUN 

in  MVL7; 

CLOCK 

in  BIT; 

RESETout 

out  MVL7 

WRout 

out  MVL7 

M_I0out 

out  MVL7 

AlSout 

out  MVL7 

A2out 

out  MVL7 

INTin 

in  MVL7; 

RDYin 

in  MVL7; 

BUSYin 

in  MVL7; 

ERRin 

in  MVL7; 

ASTRout 

out  MVL7; 

C-5 


DATAout  ;  out  DWORD; 
DATAin  :  in  DWORD) ; 
end  component; 


component  Sys_clk 

generic(PER;  TIME  :=  125  ns); 
port(CLK2:  inout  BIT;  RUN;  in  HVL7) ; 
end  component; 


signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
s ignal 
signal 
signal 
signal 
signal 


STOP.SIM  :  BOOLEAN 

RUN  ;  MVL7 ; 

CLK,  CLOCK,  CLK2  :  BIT; 

RESETIN,  RESETout  :  HVL7; 

WR,  WRout  :  HVL7; 

NPSl,  M.IOout  :  MVL7; 

NPS2,  A15out  HVL7; 

CMDO,  A2out  HVL7; 

INTR,  INTin  HVL7; 

READYO,  RDYin  :  MVL7; 

BUSY,  BUSY in  :  HVL7; 

ERROR,  ERRin  :  HVL7; 

ADD.STR,  ASTRout  ;  MVL7; 

SYSIN,  DATAout  ;  DWORD; 

SYSOUT,  DATAin  :  DWORD; 


false; 


begin 


CLOCK.CKT:  Sys.clk  port  map  (CLK2,  RUN)  ; 

CPU:  CPU.driver  port  map  (RUN,  CLOCK,  RESETout,  WRout,  M_I0out,  A15out,  A2out , 
INTin,  RDYin,  BUSYin,  ERRin,  ASTRout,  DATAout,  DATAin); 

COPROC;  DES.sys  port  map  (RUN,  CLK,  RESETIN,  WR,  NPSl,  NPS2,  CMDO,  INTR, 

READYO,  BUSY,  ERROR,  ADD.STR,  SYSIN,  SYSOUT); 


—  CPU.driver  to  DES_Sys  signal  correspondence 


CLK 

<=  CLK2; 

CLOCK 

<=  CLK2; 

RESETIN 

<=  RESETout; 

WR 

<=  WRout; 

NPSl 

<=  M.IOou”.; 

NPS2 

<=  AlSout; 

CMDO 

<=  A2out; 

INTin 

<=  INTR; 

RDYin 

<=  READYO; 

BUSYin 

<=  BUSY; 

ERRin 

<=  ERROR; 

ADD.STR 

<=  ASTRout; 

SYSIN 

<=  DATAout; 

DATAin 

<=  SYSOUT; 

C-6 


RUN_TEST:  process 
begin 

RUU  <=  '1'; 

STOP_SIM  <=  true  after  30_000  ns; 
wait  for  100_000  ns; 
end  process  RU1I_TEST; 


—  ST0P_C0NTR0L  process 

—  Purpose:  Terminates  the  simulation 


ST0P_C0NTR0L:  process 
begin 

wait  until  ST0P_SIM  =  truo; 

assert  false  report  "Simulation  Done"  severity  failure; 
end  process  ST0P_C0NTR0L ; 

end  Test; 
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C.3  CPU  Driver  Behavior 


The  testbench  stimulus  is  provided  by  the  CPU  driver  contained  in  this  appendix. 
The  following  source  code  activates  a  DBS  coprocessor  test  with  one  logical  process  per 
computing  node.  Multiple  LP  configurations  are  implemented  with  an  extension  to  this 
driver. 

The  CPU  control  signals  and  system  bus  are  activated  by  the  test  process  from  the 
CPU  driver  entity.  Procedures  are  included  for  frequently  used  operations  such  as  loading 
opcode  instructions  and  operands  for  the  DBS  coprocessor  system.  An  operate  procedure  is 
also  included  to  simulate  CPU’s  operation  when  not  directly  driving  the  DBS  coprocessor. 

The  CPU  driver  process  is  sensitive  to  the  DBS  coprocessor’s  RBADYO  status  line 
and.  bus  cycles  are  initiated  on  both  negative  and  positive  system  clock  transitions. 
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—  FILE:  CPU_driver . vhd 

—  AUTHOR:  Paul  J.  Taylor 

—  PURPOSE:  CPU  driver  for  DES  coprocessor  test_bench 

library  ZYCAD; 
library  DESIGN; 
use  ZYCAD. TYPES. all; 
use  WORK. all; 
use  WORK. SYSTEM. all; 
use  WORK. BUS.SYS. all; 


—  THE  ENTITY  DECLARATION: 


entity  CPU_driver  is 

port (RUN  :  in  MVL7; 

CLOCK  :  in  BIT; 

RESETout  :  out  MVL7; 
WRout  :  out  MVL7; 
M_I0out  :  out  MVL7; 
AlSout  :  out  MVL7; 
A2out  :  out  MVL7; 
INTin  :  in  MVL7; 

RDYin  :  in  MVL7; 

BUSYin  :  in  MVL7; 

ERRin  :  in  MVL7; 

ASTRout  :  out  MVL7; 
DATAout  :  out  DWORD; 
DATAin  :  in  DWORD) ; 
end  CPU_ driver; 


—  THE  ARCHITECTURAL  BODi : 


architecture  BEHAVIOR  of  CPU_ driver  is 
begin 

—  0NE_LP  process 

PURPOSE:  Exercise  the  DES  coprocessor.  Run  the  fundamental 
Procedures  (Initialize,  Post_Message,  Get_Event, 
and  Post_Event)  on  Discrete  Event  Simulation  with 
one  (1)  LP  on  CPU  node  (i.e.  car wash  config  #1), 

0NE_LP:  process 

variable  SET.UP  :  BOOLEAN  :=  false; 

variable  LOADED  :  BOOLEAN  :=  false; 

variable  DO_INIT_SIM  :  BOOLEAN  :=  false; 

variable  L0AD_P0ST_MSG  :  BOOLEAN  :=  false; 
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variable  DO_POST_MSG  :  BOOLEAN  :=  false; 
variable  LOAD_GET_EVENT  :  BOOLEAN  ;=  false; 
variable  DO_GET_EVENT  :  BOOLEAN  :=  false; 
variable  L0AD_P0ST_EVENT:  BOOLEAN  :=  false; 
variable  D0_P0ST_EVENT  :  BOOLEAN  :=  false; 

—  LOAD_INSTR  procedure 

--  PURPOSE;  Send  OPCODE  to  DES  Coprocessor 
procedure  LOAD_INSTR  (INPUT  :  DWORD)  is 
variable  OPCODE  :  DWORD; 


begin 


OPCODE  :=  INPUT; 

M.IOout  <=  'Z',  '0'  after  5  ns; 

WRout  <=  'Z’,  '1'  after  S  ns; 

AlSout  <=  'Z',  >1>  after  S  ns; 

A2out  <=  ‘Z’ ,  'O'  after  6  ns; 

DATAout  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" , 
OPCODE  after  5  ns; 

ASTRout  <=  'O’,  'I'  after  PER/2; 

end  L0AD_INSTR; 

—  LOAO.DATA  procedure 

—  PURPOSE:  Send  OPERAND  to  DES  Coprocessor 
procedure  LOAD.DATA  (INPUT  :  DWORD)  is 

variable  OPERAND  ;  DWORD; 


begin 


OPERAND  :=  INPUT; 

K.IOout  <=  'Z',  'O'  after  5  ns; 

WRout  <=  'Z'.  '!'  after  5  ns; 

AlSout  <=  'Z',  '!'  after  5  ns; 

A2out  <=  'Z',  '1'  after  5  ns; 

DATAout  <=  "ZZZZZZZZZZZZZZZZZTZZZZZZZZZZZZZZ", 
OPERAND  after  5  ns; 

ASTRout  <=  'O’,  '1'  after  PER/2; 
wait  for  PER/2; 


end  L0AD_DATA; 

- ********************4.*******^,^,^,^*********************^:^,:^^^ 

—  INT_SERV  procedure 
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—  PURPOSE:  Service  interrupt  requests  from  Coprocessor. 

Acknowledges  by  changing  status  of  control  lines 
to  address  the  coprocessor  (A15)  in  10  space  (M_I0) 
and  setting  mode  to  read  (WRout). 

After  bus  cycle  (RDYin  =  *0*),  disconnect  COP  by 
resetting  control  lines. 

— ♦♦♦♦♦♦♦♦ ♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦**************^****+*+^**^*^^ 
procedure  INT_SERV  is 

begin 


—  COP  interrupts  CPU 


assert  (RDYin  =  ’!’)  report  "not  RDYin  during  INT  Request" 
severity  WARNING; 


if  (RDYin  =  ’!’)  then 
M.IOout  <=  '0'  after  ODEL; 
AlSout  <=  *1*  after  ODEL; 
WRout  <=  ‘O'  after  ODEL; 
wait  for  ODEL; 
wait  until  (RDYin  =  '0'); 
end  if ; 


—  and  (RDYin  =  '1'))  then 

—  I/O  address  space 

—  cop  address 

—  cpu  read  cycle 

—  ends  cop  write  cycle 


end  INT.SERV; 

- *********************** 

—  OPERATE  procedure 

—  PURPOSE:  Perform  CPU  operations.  Wait  until  interrupt 

received  from  DES  Coprocessor,  then  service  it. 

- ♦♦*♦♦♦♦♦♦*♦♦***♦*♦*****♦****♦**** 

procedure  OPERATE  is 

begin 


W0KK_:.oop; 

while  ((RUN  =  '!')  and  (BUSYin  =  '1'))  loop 
DATAout  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"; 
M.IOout  <=  ‘Z’  after  ODEL; 

WRout  <=  ‘Z’  after  ODEL; 

AlSout  <=  ‘0’  after  ODEL; 

A2out  <=  '0'  after  ODEL; 
wait  for  ODEL; 
if  (INTin  =  '1')  then 
INT.SERV; 
end  if; 

end  loop  W0RK_Loop; 


—  CPU  "on"  with  RUN 

—  Release  bus 

—  Release  Coprocessor 

_  II  II 

II  II 


—  service  int  rqst 


end  OPERATE; 

—  BEGIN  ONE.LP  PROCESS 
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begin 


—  Load  OPCODE  and  OPERANDS  for  INIT.SIM  procedure 


if  not  LOADED  then 


—  send  instr  &  data 


if  not  SET_UP  then  —  init  coprocessor 

RESETout  <=  '1'.  '0'  after  PER/2; 

SET.UP  :=  false; 
end  if ; 


LOAD_INSTR( "00000000000000000000000000000000");  —  Init.Sim  opcode 

wait  until  ((RDYin  =  *0')  and  (not  CLOCK ' stable) ) ;  —  DES  ends  bus  cycle 

L0AD_DATA( "00000000000000000000000000000000" )  ;  —  TO  LP  0|0 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable)) ;  —  DES  ends  bus  cycle 

LOAD_DATA("00000000000000000000000000000100") ;  —  LP_DELAY  4  units 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

LOAD.DATAC'OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOlO")  ;  —  #  ARCS  IN 

nait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

LOAD.DATAC'OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOll");  —  #  ARCS.OUT 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  enL  bus  cycle 

LOAD.DATA ( "00000000000000000000000000000000" ) ;  —  IN.NODE  Ol  IN  LP  0 

wait  ur>til  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

LOAD_DATA("00000000000001110000000000000000") ;  —  IN.NODE  7|  IN_LP  0 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

L0AD_DATA("00000000000000000000000000000000") ;  —  0UT_N0DE  0|  OUT  LP  0 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

L0AD_DATA("00000000000000110000000000000000") ;  —  OUT.NODE  3  I  0UT_LP  0 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 

L0AD_DATA("00000000000001000000000000000000") ;  —  OUT.NODE  4|  OUT  LP  0 

wait  until  (RDYin  =  '0');  __  DEs’ends  bus  cycle 

LOADED  :=  true; 

DO_INIT_SIM  :=  true; 

—  Test  Init.Sim  function 

if  DO_INIT_SIM  then 


OPERATE; 

if  (BUSYin  =  '0')  then 
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—  next  test  synch 


DO_INIT_SIM  :=  false; 

LOAD_POST_MSG  :=  true; 
wait  until  (not  CLOCK ' stable) ; 
end  if; 
end  if; 

—  Load  opcode  and  operands  to  run  Post_Msg 

—  *********************************tt********************t***********^::t.:t.^:^::t::t::t: 

if  (L0AD_P0ST_MSG  and  (BUSYin  =  ’0’))  then  —  load  PDST_MSG  test 

wait  until  (not  CLOCK ' stable) ;  —  start  on  clock 

L0AD_INSTR( "00100000000000000000000000000000");  —  Post_Msg  opcode 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK ' stable) )  ;  —  DES  ends  bus  cycle 

L0AD_DATA("00000000000000000000000000000000") ;  —  TD_LP  (NODE  0|  LP  0) 

wait  until  (RDYin  =  '0'  and  (not  CLOCK ' stable) )  ;  —  DES  ends  bus  cycle 

L0AD_DATA(”00000000000001110000000000000000”)  ;  —  FRDM_LP  (NODE  7  I  LP  0) 

wait  until  (RDYin  =  '0'  and  (not  CLOCK' stable) ) ;  —  DES  ends  bus  cycle 

L0AD_DATA("00000000000000000000000000001111");  —  TIME_TAG  (15  units) 

wait  until  (RDYin  =  '0'  and  (not  CLOCK ' stable) ) ;  —  DES  ends  bus  cycle 

L0AD_DATA( ’01010101010101010101010101010101");  —  MEMR.PTR  (CPU  mem.addr) 

wait  until  (RDYin  =  '0');  —  dES  ends  bus  cycle 

DO.POST.MSG  :=  true; 

end  if ; 

—  Test  Post_Msg 

*♦♦♦♦*********♦**♦*********♦♦***♦*♦*♦*♦♦*♦♦ ♦♦♦♦♦**+*++++********^*****^^^^^ 
if  DO.POST.MSG  then 

DO.POST.MSG  :=  false; 

OPERATE ; 

if  (BUSYin  =  '0')  then 
LOAD.GET.EVENT  :=  true; 

wait  until  (not  CLOCK ’ stable ) ;  —  synch  for  next  test 

end  if; 
end  if; 

*********************  +  *************♦****♦*♦****♦*♦***♦♦♦♦  +  +  +  +  +  *  + 

—  Load  opcode  and  operands  to  run  GET.EVENT 

—  All  input  arcs  must  have  message  in  CAM 

—  ARCS.IN.STAT  must  be  verified  as  good 

—  Retrieve  and  send  CPU  a  "real"  event 
—  Request  event  but  return  "wait"  message 

if  (LOAD.GET.EVENT  and  (BUSYin  =  '0'))  then  —  load  GET.EVENT  test 


-  Satisfy  message  on  all  input  arcs  requirement  for  Ip:  0  node:  0 
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LOAD_GET_EVENT  :=  false; 

Bait  until  not  CLOCK 'stable; 

LOAD_INSTR("00100000000000000000000000000000") ;  —  Post.Msg  opcode 

Bait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable)) ;  —  DES  ends  bus  cycle 


L0AD_DATA("00000000000000000000000000000000") ;  —  T0_LP  (NODE  0|  LP  0) 

Bait  until  (RDYin  =  '0'  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 


L0AD_DATA("00000000000000000000000000000000") ;  —  FROM.LP  (NODE  0|  LP  0) 

Bait  until  (RDYin  =  '0'  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 


LOAD_DATA("00000000000000000000000000001010") ;  —  TIME.TAG  (10  units) 

Bait  until  (RDYin  =  '0'  and  (not  CLOCK'stable));  —  DES  ends  bus  cycle 


L0AD_DATA("010011110100111101001111O1001111");  —  MEMR_PTR  (CPU  mem.addr) 

Bait  until  (RDYin  =  '0');  —  DES  ends  bus  cycle 

D0_P0ST_MSG  :=  true; 


if  D0_P0ST_MSG  then 
D0_P0ST_MSG  :=  false; 

OPERATE; 

if  (BUSYin  =  '0')  then 
LOAD_GET_EVENT  :=  true; 

Bait  until  (not  CLOCK'stable); 
end  if; 
end  if; 

—  Test  GET_EVENT 

"“******************************************>)>*♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*+»*:(, 
if  (LOAD_GET_EVENT  and  (BUSYin  =  '0'))  then 
LOAD_GET_EVENT  :=  false; 

Bait  until  not  CLOCK'stable; 

LOAD_INSTR("01000000000000000000000000000000") ;  --  Get_Event  opcode 

Bait  until  ((RDYin  =  '0')  and  (not  CLOCK'stable)); 

LOAD_DATA("00000000000000000000000000000000") ;  —  T0_LP  (NODE  0|  LP  0) 

Bait  until  (RDYin  =  '0'); 


—  CPU  releases  bus  and  receives  NEXT.EVENT  from  DES  Coprocessor 


DATAout  <=  "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"; 

EVENT.Loop: 

Bhile  ((RUN  =  '1')  and  (BUSYin  =  '1'))  loop 

M_I0out  <=  '0'  after  ODEL;  —  Read  event 

WRout  <=  '0'  after  ODEL; 

AlSout  <=  '1'  after  ODEL; 

Bait  for  ODEL; 
end  loop  EVENT_Loop; 

L0AD_P0ST_EVENT  ;=  true; 
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end  il; 

end  id;  —  if  LOAD_GET_EVENT 

—  Load  opcode  and  operands  to  run  Post_Event 

if  (L0AD_P0ST_EVENT  and.  bUSYin  =  *0’)  then  —  load  Post  Event 

L0AD_P0ST_EVENT  :=  fal.e; 
wait  until  not  CLOCK 'stable; 


L0AD_INSTR("01100000000000C00000000000000000") ;  —  Post.Event  opcode 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK' stable) ) ; 


LOAD_DATA("00000000000000000000000000000000") ;  —  node  0  I  LP  0 

wait  until  ((RDYin  =  '0')  and  (not  CLOCK ' stable) ) ;  —  "FROM.LP  ' 


LOAD.DATAC'OlllOlllOlllOlllOlllOlllOlllOlll");  —  Memory  Pointer 
wait  until  ((RDYin  =  '0')  and  (not  CLOCK' stable) ) ;  —  to  EVENT 


L0AD_DATA("00000000000001000000000000000000") ;  —  node  4  |  LP  0 

wait  until  (RDYin  =  '0');  —  "T0_LP" 


DO.POST.EVENT  :=  true; 

—  Test  Po3t_Event 
—*********♦***********, 
if  DO.POST.EVENT  then 


OPERATE; 

if  (BUS Yin  =  '0')  then 
D0_P0ST_EVENT  :=  false; 

wait  until  (not  CLOCK'stable) ;  —  next  test  synch 

end  if; 
end  If; 

—  D0_N0THING  (included  for  extra  time  in  simulation) 

BUSY_LOOP2:  loop 

wait  until  (not  CLOCK'stable);  —  next  test  synch 

exit  BUSY_L00P2  when  (RUN  =  '0'); 
end  loop  BUSY_L00P2; 

end  if;  —  if  L0AD_P0ST_EVENT 

end  if;  —  CHECK  FOR  STATE  TRANSITIONS! 

end  process  0NE_LP; 

end  BEHAVIOR; 
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